• Robert Jennings's avatar
    powerpc/pseries: vio bus support for CMO · a90ab95a
    Robert Jennings authored
    This is a large patch but the normal code path is not affected.  For
    non-pSeries platforms the code is ifdef'ed out and for non-CMO enabled
    pSeries systems this does not affect the normal code path.  Devices that
    do not perform DMA operations do not need modification with this patch.
    The function get_desired_dma was renamed from get_io_entitlement for
    clarity.
    
    Overview
    
    Cooperative Memory Overcommitment (CMO) allows for a set of OS partitions
    to be run with less RAM than the aggregate needs of the group of
    partitions.  The firmware will balance memory between the partitions
    and page in/out memory as needed.  Based on the number and type of IO
    adpaters preset each partition is allocated an amount of memory for
    DMA operations and this allocation will be guaranteed to the partition;
    this is referred to as the partition's 'entitlement'.
    
    Partitions running in a CMO environment can only have virtual IO devices
    present.  The VIO bus layer will manage the IO entitlement for the system.
    Accounting, at a system and per-device level, is tracked in the VIO bus
    code and exposed via sysfs.  A set of dma_ops functions are added to
    the bus to allow for this accounting.
    
    Bus initialization
    
    At initialization, the bus will calculate the minimum needs of the system
    based on providing each device present with a standard minimum entitlement
    along with a spare allocation for the bus to handle hotplug events.
    If the minimum needs can not be met the system boot will be halted.
    
    Device changes
    
    The significant changes for devices while running under CMO are that the
    devices must specify how much dedicated IO entitlement they desire and
    must also handle DMA mapping errors that can occur due to constrained
    IO memory.  The virtual IO drivers are modified to silence errors when
    DMA mappings fail for CMO and handle these failures gracefully.
    
    Each devices will be guaranteed a minimum entitlement that can always
    be mapped.  Devices will specify how much entitlement they desire and
    the VIO bus will attempt to provide for this.  Devices can change their
    desired entitlement level at any point in time to address particular needs
    (via vio_cmo_set_dev_desired()), not just at device probe time.
    
    VIO bus changes
    
    The system will have a particular entitlement level available from which
    it can provide memory to the devices.  The bus defines two pools of memory
    within this entitlement, the reserved and excess pools.  Each device is
    provided with it's own entitlement no less than a system defined minimum
    entitlement and no greater than what the device has specified as it's
    desired entitlement.  The entitlement provided to devices comes from the
    reserve pool.  The reserve pool can also contain a spare allocation as
    large as the system defined minimum entitlement which is used for device
    hotplug events.  Any entitlement not needed to fulfill the needs of a
    reserve pool is placed in the excess pool.  Each device is guaranteed
    that it can map up to it's entitled level; additional mapping are possible
    as long as there is unmapped memory in the excess pool.
    
    Bus probe
    
    As the system starts, each device is given an entitlement equal only
    to the system defined minimum entitlement.  The reserve pool is equal
    to the sum of these entitlements, plus a spare allocation.  The VIO bus
    also tracks the aggregate desired entitlement of all the devices.  If the
    system desired entitlement is greater than the size of the reserve pool,
    when devices unmap IO memory it will be reserved and a balance operation
    will be scheduled for some time in the future.
    
    Entitlement balancing
    
    The balance function tries to fairly distribute entitlement between the
    devices in the system with the goal of providing each device with it's
    desired amount of entitlement.  Devices using more than what would be
    ideal will have their entitled set-point adjusted; this will effectively
    set a goal for lower IO memory usage as future mappings can fail and
    deallocations will trigger a balance operation to distribute the newly
    unmapped memory.  A fair distribution of entitlement can take several
    balance operations to achieve.  Entitlement changes and device DLPAR
    events will alter the state of CMO and will trigger balance operations.
    
    Hotplug events
    
    The VIO bus allows for changes in system entitlement at run-time via
    'vio_cmo_entitlement_update()'.  When devices are added the hotplug
    device event will be preceded by a system entitlement increase and this
    is reversed when devices are removed.
    
    The following changes are made that the VIO bus layer for CMO:
     * add IO memory accounting per device structure.
     * add IO memory entitlement query function to driver structure.
     * during vio bus probe, if CMO is enabled, check that driver has
       memory entitlement query function defined.  Fail if function not defined.
     * fail to register driver if io entitlement function not defined.
     * create set of dma_ops at vio level for CMO that will track allocations
       and return DMA failures once entitlement is reached.  Entitlement will
       limited by overall system entitlement.  Devices will have a reserved
       quantity of memory that is guaranteed, the rest can be used as available.
     * expose entitlement, current allocation, desired allocation, and the
       allocation error counter for devices to the user through sysfs
     * provide mechanism for changing a device's desired entitlement at run time
       for devices as an exported function and sysfs tunable
     * track any DMA failures for entitled IO memory for each vio device.
     * check entitlement against available system entitlement on device add
     * track entitlement metrics (high water mark, current usage)
     * provide function to reset high water mark
     * provide minimum and desired entitlement numbers at a bus level
     * provide drivers with a minimum guaranteed entitlement
     * balance available entitlement between devices to satisfy their needs
     * handle system entitlement changes and device hotplug
    Signed-off-by: default avatarRobert Jennings <rcj@linux.vnet.ibm.com>
    Acked-by: default avatarPaul Mackerras <paulus@samba.org>
    Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
    a90ab95a
vio.c 41.4 KB