• Roger Pau Monne's avatar
    xen/blkback: Persistent grant maps for xen blk drivers · 0a8704a5
    Roger Pau Monne authored
    This patch implements persistent grants for the xen-blk{front,back}
    mechanism. The effect of this change is to reduce the number of unmap
    operations performed, since they cause a (costly) TLB shootdown. This
    allows the I/O performance to scale better when a large number of VMs
    are performing I/O.
    
    Previously, the blkfront driver was supplied a bvec[] from the request
    queue. This was granted to dom0; dom0 performed the I/O and wrote
    directly into the grant-mapped memory and unmapped it; blkfront then
    removed foreign access for that grant. The cost of unmapping scales
    badly with the number of CPUs in Dom0. An experiment showed that when
    Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
    ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
    (at which point 650,000 IOPS are being performed in total). If more
    than 5 guests are used, the performance declines. By 10 guests, only
    400,000 IOPS are being performed.
    
    This patch improves performance by only unmapping when the connection
    between blkfront and back is broken.
    
    On startup blkfront notifies blkback that it is using persistent
    grants, and blkback will do the same. If blkback is not capable of
    persistent mapping, blkfront will still use the same grants, since it
    is compatible with the previous protocol, and simplifies the code
    complexity in blkfront.
    
    To perform a read, in persistent mode, blkfront uses a separate pool
    of pages that it maps to dom0. When a request comes in, blkfront
    transmutes the request so that blkback will write into one of these
    free pages. Blkback keeps note of which grefs it has already
    mapped. When a new ring request comes to blkback, it looks to see if
    it has already mapped that page. If so, it will not map it again. If
    the page hasn't been previously mapped, it is mapped now, and a record
    is kept of this mapping. Blkback proceeds as usual. When blkfront is
    notified that blkback has completed a request, it memcpy's from the
    shared memory, into the bvec supplied. A record that the {gref, page}
    tuple is mapped, and not inflight is kept.
    
    Writes are similar, except that the memcpy is peformed from the
    supplied bvecs, into the shared pages, before the request is put onto
    the ring.
    
    Blkback stores a mapping of grefs=>{page mapped to by gref} in
    a red-black tree. As the grefs are not known apriori, and provide no
    guarantees on their ordering, we have to perform a search
    through this tree to find the page, for every gref we receive. This
    operation takes O(log n) time in the worst case. In blkfront grants
    are stored using a single linked list.
    
    The maximum number of grants that blkback will persistenly map is
    currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
    prevent a malicios guest from attempting a DoS, by supplying fresh
    grefs, causing the Dom0 kernel to map excessively. If a guest
    is using persistent grants and exceeds the maximum number of grants to
    map persistenly the newly passed grefs will be mapped and unmaped.
    Using this approach, we can have requests that mix persistent and
    non-persistent grants, and we need to handle them correctly.
    This allows us to set the maximum number of persistent grants to a
    lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
    setting it will lead to unpredictable performance.
    
    In writing this patch, the question arrises as to if the additional
    cost of performing memcpys in the guest (to/from the pool of granted
    pages) outweigh the gains of not performing TLB shootdowns. The answer
    to that question is `no'. There appears to be very little, if any
    additional cost to the guest of using persistent grants. There is
    perhaps a small saving, from the reduced number of hypercalls
    performed in granting, and ending foreign access.
    Signed-off-by: default avatarOliver Chick <oliver.chick@citrix.com>
    Signed-off-by: default avatarRoger Pau Monne <roger.pau@citrix.com>
    Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
    [v1: Fixed up the misuse of bool as int]
    0a8704a5
blkback.c 30.2 KB