1. 13 Dec, 2011 35 commits
    • Jack Morgenstein's avatar
      mlx4_core: Modify driver initialization flow to accommodate SRIOV for Ethernet · ab9c17a0
      Jack Morgenstein authored
      1. Added module parameters sr_iov and probe_vf for controlling enablement of
         SRIOV mode.
      2. Increased default max num-qps, num-mpts and log_num_macs to accomodate
         SRIOV mode
      3. Added port_type_array as a module parameter to allow driver startup with
         ports configured as desired.
         In SRIOV mode, only ETH is supported, and this array is ignored; otherwise,
         for the case where the FW supports both port types (ETH and IB), the
         port_type_array parameter is used.
         By default, the port_type_array is set to configure both ports as IB.
      4. When running in sriov mode, the master needs to initialize the ICM eq table
         to hold the eq's for itself and also for all the slaves.
      5. mlx4_set_port_mask() now invoked from mlx4_init_hca, instead of in mlx4_dev_cap.
      6. Introduced sriov VF (slave) device startup/teardown logic (mainly procedures
         mlx4_init_slave, mlx4_slave_exit, mlx4_slave_cap, mlx4_slave_exit and flow
         modifications in __mlx4_init_one, mlx4_init_hca, and mlx4_setup_hca).
         VFs obtain their startup information from the PF (master) device via the
         comm channel.
      7. In SRIOV mode (both PF and VF), MSI_X must be enabled, or the driver
         aborts loading the device.
      8. Do not allow setting port type via sysfs when running in SRIOV mode.
      9. mlx4_get_ownership:  Currently, only one PF is supported by the driver.
         If the HCA is burned with FW which enables more than one PF, only one
         of the PFs is allowed to run.  The first one up grabs a FW ownership
         semaphone -- all other PFs will find that semaphore taken, and the
         driver will not allow them to run.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarYevgeny Petrilin <yevgenyp@mellanox.co.il>
      Signed-off-by: default avatarLiran Liss <liranl@mellanox.co.il>
      Signed-off-by: default avatarMarcel Apfelbaum <marcela@mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab9c17a0
    • Jack Morgenstein's avatar
      mlx4_core: adjust catas operation for SRIOV mode · d81c7186
      Jack Morgenstein authored
      When running in SRIOV mode, driver should not automatically start/stop
      the mlx4_core upon sensing an HCA internal error -- doing this disables/enables
      sriov, which will cause the hypervisor to hang if there are running VMs with
      attached VFs.
      
      In addition, on VMs the catas process should not run at all, since the HCA
      error buffer is not available to VMs in the BARs.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d81c7186
    • Marcel Apfelbaum's avatar
      mlx4_core: mtts resources units changed to offset · 2b8fb286
      Marcel Apfelbaum authored
      In the previous implementation mtts are managed by:
      1. order     - log(mtt segments), 'mtt segment' groups several mtts together.
      2. first_seg - segment location relative to mtt table.
      In the current implementation:
      1. order     - log(mtts) rather than segments
      2. offset    - mtt index in mtt table
      
      Note: The actual mtt allocation is made in segments but it is
            transparent to callers.
      
      Rational: The mtt resource holders are not interested on how the allocation
                of mtt is done, but rather on how they will use it.
      Signed-off-by: default avatarMarcel Apfelbaum <marcela@dev.mellanox.co.il>
      Reviewed-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b8fb286
    • Eugenia Emantayev's avatar
      mlx4_en: Allow communication between functions on same host · 5b4c4d36
      Eugenia Emantayev authored
      To enable internal loopback, always fill DMAC in control segment
      when transmitting the packet, once this is done, the packet is subject
      for loopback for if the DMAC mathces one of the multicast/unicast addresses
      registered on the physical port.
      In receive path if source MAC is our own MAC and we are not in selftest,
      or not in force LB mode - drop this packet.
      Signed-off-by: default avatarEugenia Emantayev <eugenia@mellanox.co.il>
      Signed-off-by: default avatarYevgeny Petrilin <yevgenyp@mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b4c4d36
    • Eugenia Emantayev's avatar
      mlx4: Ethernet port management modifications · ffe455ad
      Eugenia Emantayev authored
      The physical port is now common to the PF and VFs.
      The port resources and configuration is managed by the PF, VFs can
      only influence the MTU of the port, it is set as max among all functions,
      Each function allocates RX buffers of required size to meet it's MTU enforcement.
      Port management code was moved to mlx4_core, as the mlx4_en module is
      virtualization unaware
      
      Move handling qp functionality to mlx4_get_eth_qp/mlx4_put_eth_qp
      including reserve/release range and add/release unicast steering.
      Let mlx4_register/unregister_mac deal only with MAC (un)registration.
      Signed-off-by: default avatarEugenia Emantayev <eugenia@mellanox.co.il>
      Signed-off-by: default avatarYevgeny Petrilin <yevgenyp@mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ffe455ad
    • Eugenia Emantayev's avatar
      mlx4: Traffic steering management support for SRIOV · 0ec2c0f8
      Eugenia Emantayev authored
      Let multicast/unicast attaching flow go through resource tracker.
      The PF is the one responsible for managing all the steering entries.
      Define and use module parameter that determines the number of qps
      per multicast group.
      Minor changes in function calls according to changed prototype.
      Signed-off-by: default avatarEugenia Emantayev <eugenia@mellanox.co.il>
      Signed-off-by: default avatarYevgeny Petrilin <yevgenyp@mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ec2c0f8
    • Jack Morgenstein's avatar
    • Eli Cohen's avatar
      mlx4_core: resource tracking for HCA resources used by guests · c82e9aa0
      Eli Cohen authored
      The resource tracker is used to track usage of HCA resources by the different
      guests.
      
      Virtual functions (VFs) are attached to guest operating systems but
      resources are allocated from the same pool and are assigned to VFs. It is
      essential that hostile/buggy guests not be able to affect the operation of
      other VFs, possibly attached to other guest OSs since ConnectX firmware is not
      tolerant to misuse of resources.
      
      The resource tracker module associates each resource with a VF and maintains
      state information for the allocated object. It also defines allowed state
      transitions and enforces them.
      
      Relationships between resources are also referred to. For example, CQs are
      pointed to by QPs, so it is forbidden to destroy a CQ if a QP refers to it.
      
      ICM memory is always accessible through the primary function and hence it is
      allocated by the owner of the primary function.
      
      When a guest dies, an FLR is generated for all the VFs it owns and all the
      resources it used are freed.
      
      The tracked resource types are: QPs, CQs, SRQs, MPTs, MTTs, MACs, RES_EQs,
      and XRCDNs.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c82e9aa0
    • Jack Morgenstein's avatar
      mlx4_core: Add wrapper functions and comm channel and slave event support to EQs · acba2420
      Jack Morgenstein authored
      Passing async events to slaves:
      In SRIOV mode, each slave creates its own async EQ, but only the master can
      register directly with the FW to receive async events.  Async events which
      should be passed to slaves (such as a WQ_ACCESS_ERROR for a QP owned by a slave)
      are generated at the slave by the master using the GEN_EQE FW command.
      
      Wrapper functions: mlx4_MAP_EQ_wrapper
      Only the master can map an EQ. The slave commands to map their EQs arrive
      at the master via the comm channel.  The master then invokes the wrapper
      function to do the work (and enter the resource in the tracking database).
      
      New events: COMM_CHANNEL and FLR
      The COMM_CHANNEL event arrives only at the master, and signals that
      a slave has posted a command on the comm channel.
      The FLR event is generated by the FW when a guest operating a VF
      unexpectedly goes down.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acba2420
    • Jack Morgenstein's avatar
      mlx4_core: mtt modifications for SRIOV · ea51b377
      Jack Morgenstein authored
      MTTs are resources which are allocated and tracked by the PF driver.
      In multifunction mode, the allocation and icm mapping is done in
      the resource tracker (later patch in this sequence).
      
      To accomplish this, we have "work" functions whose names start with
      "__", and "request" functions (same name, no __). If we are operating
      in multifunction mode, the request function actually results in
      comm-channel commands being sent (ALLOC_RES or FREE_RES).
      The PF-driver comm-channel handler will ultimately invoke the
      "work" (__) function and return the result.
      
      If we are not in multifunction mode, the "work" handler is invoked
      immediately.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea51b377
    • Jack Morgenstein's avatar
      mlx4_core: cq modifications for SRIOV · d7233386
      Jack Morgenstein authored
      CQs are resources which are allocated and tracked by the PF driver.
      In multifunction mode, the allocation and icm mapping is done in
      the resource tracker (later patch in this sequence).
      
      To accomplish this, we have "work" functions whose names start with
      "__", and "request" functions (same name, no __). If we are operating
      in multifunction mode, the request function actually results in
      comm-channel commands being sent (ALLOC_RES or FREE_RES).
      The PF-driver comm-channel handler will ultimately invoke the
      "work" (__) function and return the result.
      
      If we are not in multifunction mode, the "work" handler is invoked
      immediately.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7233386
    • Jack Morgenstein's avatar
      mlx4_core: qp modifications for SRIOV · fe9a2603
      Jack Morgenstein authored
      QPs are resources which are allocated and tracked by the PF driver.
      In multifunction mode, the allocation and icm mapping is done in
      the resource tracker (later patch in this sequence).
      
      To accomplish this, we have "work" functions whose names start with
      "__", and "request" functions (same name, no __). If we are operating
      in multifunction mode, the request function actually results in
      comm-channel commands being sent (ALLOC_RES or FREE_RES).
      The PF-driver comm-channel handler will ultimately invoke the
      "work" (__) function and return the result.
      
      If we are not in multifunction mode, the "work" handler is invoked
      immediately.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe9a2603
    • Jack Morgenstein's avatar
      mlx4_core: srq modifications for SRIOV · 3ec65b2b
      Jack Morgenstein authored
      SRQs are resources which are allocated and tracked by the PF driver.
      In multifunction mode, the allocation and icm mapping is done in
      the resource tracker (later patch in this sequence).
      
      To accomplish this, we have "work" functions whose names start with
      "__", and "request" functions (same name, no __). If we are operating
      in multifunction mode, the request function actually results in
      comm-channel commands being sent (ALLOC_RES or FREE_RES).
      The PF-driver comm-channel handler will ultimately invoke the
      "work" (__) function and return the result.
      
      If we are not in multifunction mode, the "work" handler is invoked
      immediately.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ec65b2b
    • Marcel Apfelbaum's avatar
      mlx4_core: Added FW commands and their wrappers for supporting SRIOV · 5cc914f1
      Marcel Apfelbaum authored
      The following commands are added here:
      1. QUERY_FUNC_CAP and its wrapper.  This function is used by VFs when
         they start up to receive configuration information from the PF, such
         as resource quotas for this VF, which ports should be used (currently
         two), what protocol is running on the port (currently Ethernet ONLY,
         or port not active).
      
      2. QUERY_PORT and its wrapper. Previously, this FW command was invoked directly
         by the ETH driver (en_port.c) using mlx4_cmd_box. Virtualization is now
         required here (the VF's MAC address must be substituted for the PFs
         MAC address returned by the FW). We changed the invocation
         in the ETH driver to use mlx4_QUERY_PORT, and added the wrapper.
      
      3. QUERY_HCA. Used by the VF to determine how the HCA was initialized.
         For now, we need only the multicast table member entry size
         (log2_mc_table_entry_sz, in the ConnectX PRM).  No wrapper is needed
         here, because the data may be passed as is to the VF without modification).
      
         In this command, we have added a GLOBAL_CAPS field for passing required
         configuration information from FW to a VF (this field is to allow safely
                         adding new SRIOV capabilities which require support in VF drivers, too).
         Bits will set here by FW in response to PF-driver configuration commands which
         will activate as yet undefined new SRIOV features. The VF will test to see that
         all required capabilities indicated by this field are supported (i.e., if a bit
         is set and the VF driver does not recognize that bit, it must abort
         its initialization).  Currently, no bits are set.
      
      4. Added a CLOSE_PORT wrapper.  The PF context needs to keep track of how many VF contexts
         have the port open.  The PF context will not actually issue the FW close port command
         until the last port user issues a CLOSE_PORT request.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarYevgeny Petrilin <yevgenyp@mellanox.co.il>
      Signed-off-by: default avatarMarcel Apfelbaum <marcela@mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cc914f1
    • Yevgeny Petrilin's avatar
      net/mlx4_core: Implement the master-slave communication channel · e8f081aa
      Yevgeny Petrilin authored
      When SRIOV is enabled, pf and vfs communicate via shared comm channel.
      The vf gets its side of the comm channel via a VF BAR.
      Each VF (slave) creates its vHCR (virtual HCA Command Register),
      Its DMA address is passed to the PF (master) using Communication Channel Register.
      The same Register is used to notify the master of commands posted by the
      slaves and for the master to pass events to the slaves, such as command completions
      and asynchronous events.
      
      The vHCR format is identical to the HCR format, except for the 'go' and 't' bits,
      which are reserved in the vHCR. Posting commands to the vHCR is identical to
      the way it is done with the HCR, albeit that the function/PF token fields are
      used instead of the HCR go bit.
      Specifically:
      - When the function prepares a new command in the vHCR, it issues the Post_vHCR_cmd
        communication channel command and toggles the value of the function token;
        when PF token has an equal value, the command has been accepted and a new command may be posted.
      - When the PF detects a Post_vHCR_cmd command, it concludes that a new command is available in the vHCR;
        after processing the command, the PF toggles the PF token to match the function token.
      
      When the 'e' bit is not set, the completion of a Post_vHCR_cmd command also indicates
      the completion the vHCR command. If, however, the 'e' bit is set, the completion of a
      Post_vHCR_cmd command only indicates that the vHCR command has been accepted for execution by the PF.
      
      Function commands are processed by the PF as follows:
      -DMA (using the ACCESS_MEM command) the vHCR image into a shadow buffer.
      -Validate that the opcode is non-privileged, and that the opcode- and input-modifiers are legal.
      -DMA the in-box (if required) into a shadow buffer.
      -Validate the command:
      	o Resource ranges (e.g., QP ranges).
      	o Partition key.
      	o Ranges of referenced resources (e.g., CQs within QP contexts).
      -If the 'e' bit is set
      	o complete the Post_vHCR_cmd command
      -Execute the command on the HCR.
      -DMA the results to the vHCR out-box (if required).
      -If the 'e' bit is set
      	o Indicate command completion by generating a completion event using the GEN_EQE command
      -Otherwise
      	o DMA the command status to the vHCR
      	o Complete the Post_vHCR_cmd command
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarYevgeny Petrillin <yevgenyp@mellanox.com>
      Signed-off-by: default avatarLiran Liss <liranl@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8f081aa
    • Jack Morgenstein's avatar
      mlx4_core: Reduce number of PD bits to 17 · f5311ac1
      Jack Morgenstein authored
      When SRIOV is enabled on the chip (at FW burning time),
      the HCA uses only 17 bits for the PD. The remaining 7 high-order bits
      are ignored.
      
      Change the allocator to return only 17 bits for the PD.  The MSB 7
      bits will be used to encode the slave number for consistency
      checking later on in the resource tracker.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5311ac1
    • Jack Morgenstein's avatar
      mlx4_core: Add "native" argument to mlx4_cmd and its callers (where needed) · f9baff50
      Jack Morgenstein authored
      For SRIOV, some Hypervisor commands can be executed directly (native = 1).
      Others should go through the command wrapper flow (for tracking resource
      usage, for example, or for changing some HCA configurations that slaves
      need to be notified of).
      
      This patch sets the groundwork for this capability -- adding the correct
      value of "native" in each case.
      
      Note that if SRIOV is not activated, this parameter has no effect.
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9baff50
    • Jack Morgenstein's avatar
      mlx4: Extanding port_mask functionality · 65dab25d
      Jack Morgenstein authored
      Port mask now has additional state.
      Port can be set as "none". In this case neither the mlx4_en or mlx4_ib
      drivers take ownership of the port.
      In multifunction mode there is an option to set the vfs as single ported devices.
      (in single function mode, both physical ports belong to same function)
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarYevgeny Petrilin <yevgenyp@mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65dab25d
    • Jack Morgenstein's avatar
      mlx4_core: initial header-file changes for SRIOV support · 623ed84b
      Jack Morgenstein authored
      These changes will not affect module operation as yet. They
      are only to get some structs and enums in place for use by
      subsequent patches (making those smaller).
      
      Added here:
      * sriov state structs and inlines (mlx4_is_master/slave/mfunc)
      * comm-channel and vhcr support structures
      * enum values for new FW and comm-channel virtual commands
        (i.e., commands, passed via the comm channel to the PF-driver).
      * prototypes for many command wrapper functions (used by the
        PF context for processing FW commands passed to it by the VFs).
      * struct mlx4_eqe is moved from eq.c to mlx4.h (it will be used
        by other mlx4_core source files).
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      623ed84b
    • Eric Dumazet's avatar
    • Sathya Perla's avatar
      be2net: refactor/cleanup vf configuration code · 11ac75ed
      Sathya Perla authored
      - use adapter->num_vfs (and not the module param) to store the actual
      number of vfs created. Use the same variable to reflect SRIOV
      enable/disable state. So, drop the adapter->sriov_enabled field.
      
      - use for_all_vfs() macro in VF configuration code
      
      - drop the "vf_" prefix for the fields of be_vf_cfg; the prefix is
      redundant and removing it helps reduce line wrap
      Signed-off-by: default avatarSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11ac75ed
    • Sathya Perla's avatar
      be2net: fix ethtool ringparam reporting · 110b82bc
      Sathya Perla authored
      The ethtool "-g" option is supposed to report the max queue length and
      user modified queue length for RX and TX queues.  be2net doesn't support
      user modification of queue lengths. So, the correct values for these
      would be the max numbers.
      be2net incorrectly reports the queue used values for these fields.
      Signed-off-by: default avatarSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      110b82bc
    • Dmitry Kravkov's avatar
      bnx2x: properly update skb when mtu > 1500 · 036d2df9
      Dmitry Kravkov authored
      Since commit e52fcb24 newly allocated
      skb for small packets are not updated properly and dropped by stack.
      Signed-off-by: default avatarDmitry Kravkov <dmitry@broadcom.com>
      Signed-off-by: default avatarEilon Greenstein <eilong@broadcom.com>
      Acked-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      036d2df9
    • Hagen Paul Pfeifer's avatar
      netem: add cell concept to simulate special MAC behavior · 90b41a1c
      Hagen Paul Pfeifer authored
      This extension can be used to simulate special link layer
      characteristics. Simulate because packet data is not modified, only the
      calculation base is changed to delay a packet based on the original
      packet size and artificial cell information.
      
      packet_overhead can be used to simulate a link layer header compression
      scheme (e.g. set packet_overhead to -20) or with a positive
      packet_overhead value an additional MAC header can be simulated. It is
      also possible to "replace" the 14 byte Ethernet header with something
      else.
      
      cell_size and cell_overhead can be used to simulate link layer schemes,
      based on cells, like some TDMA schemes. Another application area are MAC
      schemes using a link layer fragmentation with a (small) header each.
      Cell size is the maximum amount of data bytes within one cell. Cell
      overhead is an additional variable to change the per-cell-overhead
      (e.g.  5 byte header per fragment).
      
      Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per
      cell overhead 5 byte):
      
        tc qdisc add dev eth0 root netem rate 5kbit 20 100 5
      Signed-off-by: default avatarHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90b41a1c
    • David S. Miller's avatar
    • Glauber Costa's avatar
      Display maximum tcp memory allocation in kmem cgroup · 0850f0f5
      Glauber Costa authored
      This patch introduces kmem.tcp.max_usage_in_bytes file, living in the
      kmem_cgroup filesystem. The root cgroup will display a value equal
      to RESOURCE_MAX. This is to avoid introducing any locking schemes in
      the network paths when cgroups are not being actively used.
      
      All others, will see the maximum memory ever used by this cgroup.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reviewed-by: default avatarHiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0850f0f5
    • Glauber Costa's avatar
      Display current tcp failcnt in kmem cgroup · ffea59e5
      Glauber Costa authored
      This patch introduces kmem.tcp.failcnt file, living in the
      kmem_cgroup filesystem. Following the pattern in the other
      memcg resources, this files keeps a counter of how many times
      allocation failed due to limits being hit in this cgroup.
      The root cgroup will always show a failcnt of 0.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reviewed-by: default avatarHiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ffea59e5
    • Glauber Costa's avatar
      Display current tcp memory allocation in kmem cgroup · 5a6dd343
      Glauber Costa authored
      This patch introduces kmem.tcp.usage_in_bytes file, living in the
      kmem_cgroup filesystem. It is a simple read-only file that displays the
      amount of kernel memory currently consumed by the cgroup.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reviewed-by: default avatarHiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a6dd343
    • Glauber Costa's avatar
      tcp buffer limitation: per-cgroup limit · 3aaabe23
      Glauber Costa authored
      This patch uses the "tcp.limit_in_bytes" field of the kmem_cgroup to
      effectively control the amount of kernel memory pinned by a cgroup.
      
      This value is ignored in the root cgroup, and in all others,
      caps the value specified by the admin in the net namespaces'
      view of tcp_sysctl_mem.
      
      If namespaces are being used, the admin is allowed to set a
      value bigger than cgroup's maximum, the same way it is allowed
      to set pretty much unlimited values in a real box.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reviewed-by: default avatarHiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3aaabe23
    • Glauber Costa's avatar
      per-netns ipv4 sysctl_tcp_mem · 3dc43e3e
      Glauber Costa authored
      This patch allows each namespace to independently set up
      its levels for tcp memory pressure thresholds. This patch
      alone does not buy much: we need to make this values
      per group of process somehow. This is achieved in the
      patches that follows in this patchset.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3dc43e3e
    • Glauber Costa's avatar
      tcp memory pressure controls · d1a4c0b3
      Glauber Costa authored
      This patch introduces memory pressure controls for the tcp
      protocol. It uses the generic socket memory pressure code
      introduced in earlier patches, and fills in the
      necessary data in cg_proto struct.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1a4c0b3
    • Glauber Costa's avatar
      socket: initial cgroup code. · e1aab161
      Glauber Costa authored
      The goal of this work is to move the memory pressure tcp
      controls to a cgroup, instead of just relying on global
      conditions.
      
      To avoid excessive overhead in the network fast paths,
      the code that accounts allocated memory to a cgroup is
      hidden inside a static_branch(). This branch is patched out
      until the first non-root cgroup is created. So when nobody
      is using cgroups, even if it is mounted, no significant performance
      penalty should be seen.
      
      This patch handles the generic part of the code, and has nothing
      tcp-specific.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtsu.com>
      CC: Kirill A. Shutemov <kirill@shutemov.name>
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1aab161
    • Glauber Costa's avatar
      foundations of per-cgroup memory pressure controlling. · 180d8cd9
      Glauber Costa authored
      This patch replaces all uses of struct sock fields' memory_pressure,
      memory_allocated, sockets_allocated, and sysctl_mem to acessor
      macros. Those macros can either receive a socket argument, or a mem_cgroup
      argument, depending on the context they live in.
      
      Since we're only doing a macro wrapping here, no performance impact at all is
      expected in the case where we don't have cgroups disabled.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reviewed-by: default avatarHiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      180d8cd9
    • Glauber Costa's avatar
      Basic kernel memory functionality for the Memory Controller · e5671dfa
      Glauber Costa authored
      This patch lays down the foundation for the kernel memory component
      of the Memory Controller.
      
      As of today, I am only laying down the following files:
      
       * memory.independent_kmem_limit
       * memory.kmem.limit_in_bytes (currently ignored)
       * memory.kmem.usage_in_bytes (always zero)
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      CC: Kirill A. Shutemov <kirill@shutemov.name>
      CC: Paul Menage <paul@paulmenage.org>
      CC: Greg Thelen <gthelen@google.com>
      CC: Johannes Weiner <jweiner@redhat.com>
      CC: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5671dfa
    • Laszlo Ersek's avatar
      xen-netfront: delay gARP until backend switches to Connected · 08e34eb1
      Laszlo Ersek authored
      After a guest is live migrated, the xen-netfront driver emits a gratuitous
      ARP message, so that networking hardware on the target host's subnet can
      take notice, and public routing to the guest is re-established. However,
      if the packet appears on the backend interface before the backend is added
      to the target host's bridge, the packet is lost, and the migrated guest's
      peers become unable to talk to the guest.
      
      A sufficient two-parts condition to prevent the above is:
      
      (1) ensure that the backend only moves to Connected xenbus state after its
      hotplug scripts completed, ie. the netback interface got added to the
      bridge; and
      
      (2) ensure the frontend only queues the gARP when it sees the backend move
      to Connected.
      
      These two together provide complete ordering. Sub-condition (1) is already
      satisfied by commit f942dc25 in Linus' tree, based on commit
      6b0b80ca7165 from [1].
      
      In general, the full condition is sufficient, not necessary, because,
      according to [2], live migration has been working for a long time without
      satisfying sub-condition (2). However, after 6b0b80ca7165 was backported
      to the RHEL-5 host to ensure (1), (2) still proved necessary in the RHEL-6
      guest. This patch intends to provide (2) for upstream.
      
      The Reviewed-by line comes from [3].
      
      [1] git://xenbits.xen.org/people/ianc/linux-2.6.git#upstream/dom0/backend/netback-history
      [2] http://old-list-archives.xen.org/xen-devel/2011-06/msg01969.html
      [3] http://old-list-archives.xen.org/xen-devel/2011-07/msg00484.htmlSigned-off-by: default avatarLaszlo Ersek <lersek@redhat.com>
      Reviewed-by: default avatarIan Campbell <ian.campbell@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08e34eb1
  2. 12 Dec, 2011 4 commits
  3. 11 Dec, 2011 1 commit