• Boaz Harrosh's avatar
    ore: RAID5 Write · 769ba8d9
    Boaz Harrosh authored
    This is finally the RAID5 Write support.
    
    The bigger part of this patch is not the XOR engine itself, But the
    read4write logic, which is a complete mini prepare_for_striping
    reading engine that can read scattered pages of a stripe into cache
    so it can be used for XOR calculation. That is, if the write was not
    stripe aligned.
    
    The main algorithm behind the XOR engine is the 2 dimensional array:
    	struct __stripe_pages_2d.
    A drawing might save 1000 words
    ---
    
    __stripe_pages_2d
           |
     n = pages_in_stripe_unit;
     w = group_width - parity;
           |                            pages array presented to the XOR lib
           |                                                |
           V                                                |
     __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
           |                                                |
     __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
           |
    ...    |                         ...
           |
     __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
                                   ^
                                   |
               data added columns first then row
    
    ---
    The pages are put on this array columns first. .i.e:
    	p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
    So we are doing a corner turn of the pages.
    
    Note that pages will zigzag down and left. but are put sequentially
    in growing order. So when the time comes to XOR the stripe, only the
    beginning and end of the array need be checked. We scan the array
    and any NULL spot will be field by pages-to-be-read.
    
    The FS that wants to support RAID5 needs to supply an
    operations-vector that searches a given page in cache, and specifies
    if the page is uptodate or need reading. All these pages to be read
    are put on a slave ore_io_state and synchronously read. All the pages
    of a stripe are read in one IO, using the scatter gather mechanism.
    
    In write we constrain our IO to only be incomplete on a single
    stripe. Meaning either the complete IO is within a single stripe so
    we might have pages to read from both beginning  or end of the
    strip. Or we have some reading to do at beginning but end at strip
    boundary. The left over pages are pushed to the next IO by the API
    already established by previous work, where an IO offset/length
    combination presented to the ORE might get the length truncated and
    the user must re-submit the leftover pages. (Both exofs and NFS
    support this)
    
    But any ORE user should make it's best effort to align it's IO
    before hand and avoid complications. A cached ore_layout->stripe_size
    member can be used for that calculation. (NOTE: that ORE demands
    that stripe_size may not be bigger then 32bit)
    
    What else? Well read it and tell me.
    Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
    769ba8d9
osd_ore.h 5.31 KB