1. 13 Feb, 2017 8 commits
    • Shaohua Li's avatar
      MD: add doc for raid5-cache · 5a6265f9
      Shaohua Li authored
      I'm starting document of the raid5-cache feature. Please note this is a
      kernel doc instead of a mdadm manual, so I don't add the details about
      how to use the feature in mdadm side.
      
      Cc: NeilBrown <neilb@suse.com>
      Reviewed-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      5a6265f9
    • Shaohua Li's avatar
      1601c590
    • NeilBrown's avatar
      md: ensure md devices are freed before module is unloaded. · 9356863c
      NeilBrown authored
      Commit: cbd19983 ("md: Fix unfortunate interaction with evms")
      change mddev_put() so that it would not destroy an md device while
      ->ctime was non-zero.
      
      Unfortunately, we didn't make sure to clear ->ctime when unloading
      the module, so it is possible for an md device to remain after
      module unload.  An attempt to open such a device will trigger
      an invalid memory reference in:
        get_gendisk -> kobj_lookup -> exact_lock -> get_disk
      
      when tring to access disk->fops, which was in the module that has
      been removed.
      
      So ensure we clear ->ctime in md_exit(), and explain how that is useful,
      as it isn't immediately obvious when looking at the code.
      
      Fixes: cbd19983 ("md: Fix unfortunate interaction with evms")
      Tested-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      9356863c
    • Song Liu's avatar
      md/r5cache: improve journal device efficiency · 39b99586
      Song Liu authored
      It is important to be able to flush all stripes in raid5-cache.
      Therefore, we need reserve some space on the journal device for
      these flushes. If flush operation includes pending writes to the
      stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
      for the flush out. This reduces the efficiency of journal space.
      If we exclude these pending writes from flush operation, we only
      need (conf->max_degraded + 1) pages per stripe.
      
      With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
      pending writes will be excluded from stripe flush out. Therefore,
      we can reduce reserved space for flush out and thus improve journal
      device efficiency.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      39b99586
    • Song Liu's avatar
      md/r5cache: enable chunk_aligned_read with write back cache · 03b047f4
      Song Liu authored
      Chunk aligned read significantly reduces CPU usage of raid456.
      However, it is not safe to fully bypass the write back cache.
      This patch enables chunk aligned read with write back cache.
      
      For chunk aligned read, we track stripes in write back cache at
      a bigger granularity, "big_stripe". Each chunk may contain more
      than one stripe (for example, a 256kB chunk contains 64 4kB-page,
      so this chunk contain 64 stripes). For chunk_aligned_read, these
      stripes are grouped into one big_stripe, so we only need one lookup
      for the whole chunk.
      
      For each big_stripe, struct big_stripe_info tracks how many stripes
      of this big_stripe are in the write back cache. We count how many
      stripes of this big_stripe are in the write back cache. These
      counters are tracked in a radix tree (big_stripe_tree).
      r5c_tree_index() is used to calculate keys for the radix tree.
      
      chunk_aligned_read() calls r5c_big_stripe_cached() to look up
      big_stripe of each chunk in the tree. If this big_stripe is in the
      tree, chunk_aligned_read() aborts. This look up is protected by
      rcu_read_lock().
      
      It is necessary to remember whether a stripe is counted in
      big_stripe_tree. Instead of adding new flag, we reuses existing flags:
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
      two flags are set, the stripe is counted in big_stripe_tree. This
      requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
      r5c_try_caching_write(); and moving clear_bit of
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
      r5c_finish_stripe_write_out().
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      03b047f4
    • Song Liu's avatar
      EXPORT_SYMBOL radix_tree_replace_slot · 10257d71
      Song Liu authored
      It will be used in drivers/md/raid5-cache.c
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      10257d71
    • Shaohua Li's avatar
      raid5: only dispatch IO from raid5d for harddisk raid · 765d704d
      Shaohua Li authored
      We made raid5 stripe handling multi-thread before. It works well for
      SSD. But for harddisk, the multi-threading creates more disk seek, so
      not always improve performance. For several hard disks based raid5,
      multi-threading is required as raid5d becames a bottleneck especially
      for sequential write.
      
      To overcome the disk seek issue, we only dispatch IO from raid5d if the
      array is harddisk based. Other threads can still handle stripes, but
      can't dispatch IO.
      
      Idealy, we should control IO dispatching order according to IO position
      interrnally. Right now we still depend on block layer, which isn't very
      efficient sometimes though.
      
      My setup has 9 harddisks, each disk can do around 180M/s sequential
      write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
      write. The test machine uses an ATOM CPU. I measure sequential write
      with large iodepth bandwidth to raid array:
      
      without patch: ~600M/s
      without patch and group_thread_cnt=4: 750M/s
      with patch and group_thread_cnt=4: 950M/s
      with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
      
      We are pretty close to the maximum bandwidth in the large iodepth
      iodepth case. The performance gap of small iodepth sequential write
      between software raid and theory value is still very big though, because
      we don't have an efficient pipeline.
      
      Cc: NeilBrown <neilb@suse.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      765d704d
    • colyli@suse.de's avatar
      md linear: fix a race between linear_add() and linear_congested() · 03a9e24e
      colyli@suse.de authored
      Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
      disk to a md linear device causes kernel crash at linear_congested(). From
      the crash image analysis, I find in linear_congested(), mddev->raid_disks
      contains value N, but conf->disks[] only has N-1 pointers available. Then
      a NULL pointer deference crashes the kernel.
      
      There is a race between linear_add() and linear_congested(), RCU stuffs
      used in these two functions cannot avoid the race. Since Linuv v4.0
      RCU code is replaced by introducing mddev_suspend().  After checking the
      upstream code, it seems linear_congested() is not called in
      generic_make_request() code patch, so mddev_suspend() cannot provent it
      from being called. The possible race still exists.
      
      Here I explain how the race still exists in current code.  For a machine
      has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
      md linear device; at the same time on other CPU, linear_congested() is
      called to detect whether this md linear device is congested before issuing
      an I/O request onto it.
      
      Now I use a possible code execution time sequence to demo how the possible
      race happens,
      
      seq    linear_add()                linear_congested()
       0                                 conf=mddev->private
       1   oldconf=mddev->private
       2   mddev->raid_disks++
       3                              for (i=0; i<mddev->raid_disks;i++)
       4                                bdev_get_queue(conf->disks[i].rdev->bdev)
       5   mddev->private=newconf
      
      In linear_add() mddev->raid_disks is increased in time seq 2, and on
      another CPU in linear_congested() the for-loop iterates conf->disks[i] by
      the increased mddev->raid_disks in time seq 3,4. But conf with one more
      element (which is a pointer to struct dev_info type) to conf->disks[] is
      not updated yet, accessing its structure member in time seq 4 will cause a
      NULL pointer deference fault.
      
      To fix this race, there are 2 parts of modification in the patch,
       1) Add 'int raid_disks' in struct linear_conf, as a copy of
          mddev->raid_disks. It is initialized in linear_conf(), always being
          consistent with pointers number of 'struct dev_info disks[]'. When
          iterating conf->disks[] in linear_congested(), use conf->raid_disks to
          replace mddev->raid_disks in the for-loop, then NULL pointer deference
          will not happen again.
       2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
          free oldconf memory. Because oldconf may be referenced as mddev->private
          in linear_congested(), kfree_rcu() makes sure that its memory will not
          be released until no one uses it any more.
      Also some code comments are added in this patch, to make this modification
      to be easier understandable.
      
      This patch can be applied for kernels since v4.0 after commit:
      3be260cc ("md/linear: remove rcu protections in favour of
      suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
      people who maintain kernels before Linux v4.0, they need to do some back
      back port to this patch.
      
      Changelog:
       - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
             replace rcu_call() in linear_add().
       - v2: add RCU stuffs by suggestion from Shaohua and Neil.
       - v1: initial effort.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Neil Brown <neilb@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      03a9e24e
  2. 12 Feb, 2017 1 commit
  3. 11 Feb, 2017 8 commits
  4. 10 Feb, 2017 23 commits