1. 14 Oct, 2011 7 commits
    • Boaz Harrosh's avatar
      ore/exofs: Define new ore_verify_layout · 5a51c0c7
      Boaz Harrosh authored
      All users of the ore will need to check if current code
      supports the given layout. For example RAID5/6 is not
      currently supported.
      
      So move all the checks from exofs/super.c to a new
      ore_verify_layout() to be used by ore users.
      
      Note that any new layout should be passed through the
      ore_verify_layout() because the ore engine will prepare
      and verify some internal members of ore_layout, and
      assumes it's called.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      5a51c0c7
    • Boaz Harrosh's avatar
      ore: Support for partial component table · 3bd98568
      Boaz Harrosh authored
      Users like the objlayout-driver would like to only pass
      a partial device table that covers the IO in question.
      For example exofs divides the file into raid-group-sized
      chunks and only serves group_width number of devices at
      a time.
      
      The partiality is communicated by setting
      ore_componets->first_dev and the array covers all logical
      devices from oc->first_dev upto (oc->first_dev + oc->numdevs)
      
      The ore_comp_dev() API receives a logical device index
      and returns the actual present device in the table.
      An out-of-range dev_index will BUG.
      
      Logical device index is the theoretical device index as if
      all the devices of a file are present. .i.e:
      	total_devs = group_width * mirror_p1 * group_count
      	0 <= dev_index < total_devs
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      3bd98568
    • Boaz Harrosh's avatar
      ore: Support for short read/writes · bbf9a31b
      Boaz Harrosh authored
      Memory conditions and max_bio constraints might cause us to
      not comply to the full length of the requested IO. Instead of
      failing the complete IO we can issue a shorter read/write and
      report how much was actually executed in the ios->length
      member.
      
      All users must check ios->length at IO_done or upon return of
      ore_read/write and re-issue the reminder of the bytes. Because
      other wise there is no error returned like before.
      
      This is part of the effort to support the pnfs-obj layout driver.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      bbf9a31b
    • Boaz Harrosh's avatar
      exofs: Support for short read/writes · 154a9300
      Boaz Harrosh authored
      If at read/write_done the actual IO was shorter then requested,
      reported in returned ios->length. It is not an error. The reminder
      of the pages should just be unlocked but not marked uptodate or
      end_page_writeback. They will be re issued later by the VFS.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      154a9300
    • Boaz Harrosh's avatar
      ore: Remove check for ios->kern_buff in _prepare_for_striping to later · 6851a5e5
      Boaz Harrosh authored
      Move the check and preparation of the ios->kern_buff case to
      later inside _write_mirror().
      
      Since read was never used with ios->kern_buff its support is removed
      instead of fixed.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      6851a5e5
    • Boaz Harrosh's avatar
      ore: cleanup: Embed an ore_striping_info inside ore_io_state · 98260754
      Boaz Harrosh authored
      Now that each ore_io_state covers only a single raid group.
      A single striping_info math is needed. Embed one inside
      ore_io_state to cache the calculation results and eliminate
      an extra call.
      
      Also the outer _prepare_for_striping is removed since it does nothing.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      98260754
    • Boaz Harrosh's avatar
      ore: Only IO one group at a time (API change) · b916c5cd
      Boaz Harrosh authored
      Usually a single IO is confined to one group of devices
      (group_width) and at the boundary of a raid group it can
      spill into a second group. Current code would allocate a
      full device_table size array at each io_state so it can
      comply to requests that span two groups. Needless to say
      that is very wasteful, specially when device_table count
      can get very large (hundreds even thousands), while a
      group_width is usually 8 or 10.
      
      * Change ore API to trim on IO that spans two raid groups.
        The user passes offset+length to ore_get_rw_state, the
        ore might trim on that length if spanning a group boundary.
        The user must check ios->length or ios->nrpages to see
        how much IO will be preformed. It is the responsibility
        of the user to re-issue the reminder of the IO.
      
      * Modify exofs To copy spilled pages on to the next IO.
        This means one last kick is needed after all coalescing
        of pages is done.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      b916c5cd
  2. 04 Oct, 2011 1 commit
    • Boaz Harrosh's avatar
      ore/exofs: Change the type of the devices array (API change) · d866d875
      Boaz Harrosh authored
      In the pNFS obj-LD the device table at the layout level needs
      to point to a device_cache node, where it is possible and likely
      that many layouts will point to the same device-nodes.
      
      In Exofs we have a more orderly structure where we have a single
      array of devices that repeats twice for a round-robin view of the
      device table
      
      This patch moves to a model that can be used by the pNFS obj-LD
      where struct ore_components holds an array of ore_dev-pointers.
      (ore_dev is newly defined and contains a struct osd_dev *od
       member)
      
      Each pointer in the array of pointers will point to a bigger
      user-defined dev_struct. That can be accessed by use of the
      container_of macro.
      
      In Exofs an __alloc_dev_table() function allocates the
      ore_dev-pointers array as well as an exofs_dev array, in one
      allocation and does the addresses dance to set everything pointing
      correctly. It still keeps the double allocation trick for the
      inodes round-robin view of the table.
      
      The device table is always allocated dynamically, also for the
      single device case. So it is unconditionally freed at umount.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      d866d875
  3. 03 Oct, 2011 5 commits
  4. 22 Sep, 2011 1 commit
  5. 12 Sep, 2011 9 commits
  6. 11 Sep, 2011 17 commits