- 25 Jul, 2020 22 commits
-
-
Coly Li authored
The following three sysfs files are created to display according feature set information of bcache: /sys/fs/bcache/<cache set UUID>/internal/feature_compat /sys/fs/bcache/<cache set UUID>/internal/feature_ro_compat /sys/fs/bcache/<cache set UUID>/internal/feature_incompat is added by this patch, to display feature sets information of the cache set. Now only an incompat feature 'large_bucket' added in bcache, the sysfs file content is: [large_bucket] string large_bucket means the running bcache drive supports incompat feature 'large_bucket', the wrapping [] means the 'large_bucket' feature is currently enabled on this cache set. This patch is ready to display compat and ro_compat features, in future once bcache code implements such feature sets, the according feature strings will be displayed in their sysfs files too. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
The large bucket feature is to extend bucket_size from 16bit to 32bit. When create cache device on zoned device (e.g. zoned NVMe SSD), making a single bucket cover one or more zones of the zoned device is the simplest way to support zoned device as cache by bcache. But current maximum bucket size is 16MB and a typical zone size of zoned device is 256MB, this is the major motiviation to extend bucket size to a larger bit width. This patch is the basic and first change to support large bucket size, the major changes it makes are, - Add BCH_FEATURE_INCOMPAT_LARGE_BUCKET for the large bucket feature, INCOMPAT means it introduces incompatible on-disk format change. - Add BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET) routines. - Adds __le16 bucket_size_hi into struct cache_sb_disk at offset 0x8d0 for the on-disk super block format. - For the in-memory super block struct cache_sb, member bucket_size is extended from __u16 to __32. - Add get_bucket_size() to combine the bucket_size and bucket_size_hi from struct cache_sb_disk into an unsigned int value. Since we already have large bucket size helpers meta_bucket_pages(), meta_bucket_bytes() and alloc_meta_bucket_pages(), they make sure when bucket size > 8MB, the memory allocation for bcache meta data bucket won't fail no matter how large the bucket size extended. So these meta data buckets are handled properly when the bucket size width increase from 16bit to 32bit, we don't need to worry about them. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
Currently the bcache internal btree node occupies a whole bucket. When loading the btree node from cache device into memory, mca_data_alloc() will call bch_btree_keys_alloc() to allocate memory for the whole bucket size, ilog2(b->c->btree_pages) is send to bch_btree_keys_alloc() as the parameter 'page_order'. c->btree_pages is set as bucket_pages() in bch_cache_set_alloc(), for bucket size > 8MB, ilog2(b->c->btree_pages) is 12 for 4KB page size. By default the maximum page order __get_free_pages() accepts is MAX_ORDER (11), in this condition bch_btree_keys_alloc() will always fail. Because of other over-page-order allocation failure fails the cache device registration, such btree node allocation failure wasn't observed during runtime. After other blocking page allocation failures for bucket size > 8MB, this btree node allocation issue may trigger potentical risk e.g. infinite dead-loop to retry btree node allocation after failure. This patch fixes the potential problem by setting c->btree_pages to meta_bucket_pages() in bch_cache_set_alloc(). In the condition that bucket size > 8MB, meta_bucket_pages() will always return a number which won't exceed the maximum page order of the buddy allocator. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
In bch_btree_cache_alloc() when CONFIG_BCACHE_DEBUG is configured, allocate memory for c->verify_ondisk may fail if the bucket size > 8MB, which will require __get_free_pages() to allocate continuous pages with order > 11 (the default MAX_ORDER of Linux buddy allocator). Such over size allocation will fail, and cause 2 problems, - When CONFIG_BCACHE_DEBUG is configured, bch_btree_verify() does not work, because c->verify_ondisk is NULL and bch_btree_verify() returns immediately. - bch_btree_cache_alloc() will fail due to c->verify_ondisk allocation failed, then the whole cache device registration fails. And because of this failure, the first problem of bch_btree_verify() has no chance to be triggered. This patch fixes the above problem by two means, 1) If pages allocation of c->verify_ondisk fails, set it to NULL and returns bch_btree_cache_alloc() with -ENOMEM. 2) When calling __get_free_pages() to allocate c->verify_ondisk pages, use ilog2(meta_bucket_pages(&c->sb)) to make sure ilog2() will always generate a pages order <= MAX_ORDER (or CONFIG_FORCE_MAX_ZONEORDER). Then the buddy system won't directly reject the allocation request. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
Similar to c->uuids, struct cache's prio_buckets and disk_buckets also have the potential memory allocation failure during cache registration if the bucket size > 8MB. ca->prio_buckets can be stored on cache device in multiple buckets, its in-memory space is allocated by kzalloc() interface but normally allocated by alloc_pages() because the size > KMALLOC_MAX_CACHE_SIZE. So allocation of ca->prio_buckets has the MAX_ORDER restriction too. If the bucket size > 8MB, by default the page allocator will fail because the page order > 11 (default MAX_ORDER value). ca->prio_buckets should also use meta_bucket_bytes(), meta_bucket_pages() to decide its memory size and use alloc_meta_bucket_pages() to allocate pages, to avoid the allocation failure during cache set registration when bucket size > 8MB. ca->disk_buckets is a single bucket size memory buffer, it is used to iterate each bucket of ca->prio_buckets, and compose the bio based on memory of ca->disk_buckets, then write ca->disk_buckets memory to cache disk one-by-one for each bucket of ca->prio_buckets. ca->disk_buckets should have in-memory size exact to the meta_bucket_pages(), this is the size that ca->prio_buckets will be stored into each on-disk bucket. This patch fixes the above issues and handle cache's prio_buckets and disk_buckets properly for bucket size larger than 8MB. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
Bcache allocates a whole bucket to store c->uuids on cache device, and allocates continuous pages to store it in-memory. When the bucket size exceeds maximum allocable continuous pages, bch_cache_set_alloc() will fail and cache device registration will fail. This patch allocates c->uuids by alloc_meta_bucket_pages(), and uses ilog2(meta_bucket_pages(c)) to indicate order of c->uuids pages when free it. When writing c->uuids to cache device, its size is decided by meta_bucket_pages(c) * PAGE_SECTORS. Now c->uuids is properly handled for bucket size > 8MB. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
Currently the in-memory meta data like c->uuids or c->disk_buckets are allocated by alloc_bucket_pages(). The macro alloc_bucket_pages() calls __get_free_pages() to allocated continuous pages with order indicated by ilog2(bucket_pages(c)), #define alloc_bucket_pages(gfp, c) \ ((void *) __get_free_pages(__GFP_ZERO|gfp, ilog2(bucket_pages(c)))) The maximum order is defined as MAX_ORDER, the default value is 11 (and can be overwritten by CONFIG_FORCE_MAX_ZONEORDER). In bcache code the maximum bucket size width is 16bits, this is restricted both by KEY_SIZE size and bucket_size size from struct cache_sb_disk. The maximum 16bits width and power-of-2 value is (1<<15) in unit of sector (512byte). It means the maximum value of bucket size in bytes is (1<<24) bytes a.k.a 4096 pages. When the bucket size is set to maximum permitted value, ilog2(4096) is 12, which exceeds the default maximum order __get_free_pages() can accepted, the failed pages allocation will fail cache set registration procedure and print a kernel oops message for the exceeded pages order. This patch introduces meta_bucket_pages(), meta_bucket_bytes(), and alloc_bucket_pages() helper routines. meta_bucket_pages() indicates the maximum pages can be allocated to meta data bucket, meta_bucket_bytes() indicates the according maximum bytes, and alloc_bucket_pages() does the pages allocation for meta bucket. Because meta_bucket_pages() chooses the smaller value among the bucket size and MAX_ORDER_NR_PAGES, it still works when MAX_ORDER overwritten by CONFIG_FORCE_MAX_ZONEORDER. Following patches will use these helper routines to decide maximum pages can be allocated for different meta data buckets. If the bucket size is larger than meta_bucket_bytes(), the bcache registration can continue to success, just the space more than meta_bucket_bytes() inside the bucket is wasted. Comparing bcache failed for large bucket size, wasting some space for meta data buckets is acceptable at this moment. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
We have struct cache_sb_disk for on-disk super block already, it is unnecessary to keep the in-memory super block format exactly mapping to the on-disk struct layout. This patch adds code comments to notice that struct cache_sb is not exactly mapping to cache_sb_disk, and removes the useless member csum and pad[5]. Although struct cache_sb does not belong to uapi, but there are still some on-disk format related macros reference it and it is unncessary to get rid of such dependency now. So struct cache_sb will continue to stay in include/uapi/linux/bache.h for now. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
Setting sb->first_bucket and checking sb->keys indeed are only for cache device, it does not make sense to do them in read_super() for backing device too. This patch moves the related code piece into read_super_common() explicitly for cache device and avoid the confusion. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
The new added super block version BCACHE_SB_VERSION_BDEV_WITH_FEATURES (5) BCACHE_SB_VERSION_CDEV_WITH_FEATURES value (6), is for the feature set bits. Devices have super block version equal to the new version will have three new members for feature set bits in the on-disk super block, __le64 feature_compat; __le64 feature_incompat; __le64 feature_ro_compat; They are used for further new features which may introduce on-disk format change, and avoid unncessary super block version increase. The very basic features handling code skeleton is also initialized in this patch. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
In register_cache_set(), c is pointer to struct cache_set, and ca is pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this registering cache has up to date version and other members, the in- memory version and other members should be updated to the newer value. But current implementation makes a cache set only has a single cache device, so the above assumption works well except for a special case. The execption is when a cache device new created and both ca->sb.seq and c->sb.seq are 0, because the super block is never flushed out yet. In the location for the following if() check, 2156 if (ca->sb.seq > c->sb.seq) { 2157 c->sb.version = ca->sb.version; 2158 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16); 2159 c->sb.flags = ca->sb.flags; 2160 c->sb.seq = ca->sb.seq; 2161 pr_debug("set version = %llu\n", c->sb.version); 2162 } c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0, the if() check will fail (because both values are 0), and the cache set version, set_uuid, flags and seq won't be updated. The above problem is hiden for current code, because the bucket size is compatible among different super block version. And the next time when running cache set again, ca->sb.seq will be larger than 0 and cache set super block version will be updated properly. But if the large bucket feature is enabled, sb->bucket_size is the low 16bits of the bucket size. For a power of 2 value, when the actual bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then read_super_common() will fail because the if() check to is_power_of_2(sb->bucket_size) is false. This is how the long time hidden bug is triggered. This patch modifies the if() check to the following way, 2156 if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) { Then cache set's version, set_uuid, flags and seq will always be updated corectly including for a new created cache device. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
In bch_cache_set_alloc() there is a big if() checks combined by 11 items together. When this big if() statement fails, it is difficult to tell exactly which item fails indeed. This patch disassembles this big if() checks into 11 single if() checks, which makes code debug more easier. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
The improperly set bucket or block size will trigger error in read_super_common(). For large bucket size, a more accurate error message for invalid bucket or block size is necessary. This patch disassembles the combined if() checks into multiple single if() check, and provide more accurate error message for each check failure condition. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
Later patches will introduce feature set bits to on-disk super block and increase super block version. Current code in read_super() which reads common part of super block for version BCACHE_SB_VERSION_CDEV and version BCACHE_SB_VERSION_CDEV_WITH_UUID will be shared with the new version. Therefore this patch moves the reusable part into read_super_common(), this preparation patch will make later patches more simplier and only focus on new feature set bits. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
offset_to_stripe() returns the stripe number (in type unsigned int) from an offset (in type uint64_t) by the following calculation, do_div(offset, d->stripe_size); For large capacity backing device (e.g. 18TB) with small stripe size (e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual returned value which caller receives is 536870912, due to the overflow. Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are in range [0, bcache_dev->nr_stripes - 1]. This patch adds a upper limition check in offset_to_stripe(): the max valid stripe number should be less than bcache_device->nr_stripes. If the calculated stripe number from do_div() is equal to or larger than bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes is less than INT_MAX, exceeding upper limitation doesn't mean overflow, therefore -EOVERFLOW is not used as error code.) This patch also changes nr_stripes' type of struct bcache_device from 'unsigned int' to 'int', and return value type of offset_to_stripe() from 'unsigned int' to 'int', to match their exact data ranges. All locations where bcache_device->nr_stripes and offset_to_stripe() are referenced also get updated for the above type change. Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
For some block devices which large capacity (e.g. 8TB) but small io_opt size (e.g. 8 sectors), in bcache_device_init() the stripes number calcu- lated by, DIV_ROUND_UP_ULL(sectors, d->stripe_size); might be overflow to the unsigned int bcache_device->nr_stripes. This patch uses the uint64_t variable to store DIV_ROUND_UP_ULL() and after the value is checked to be available in unsigned int range, sets it to bache_device->nr_stripes. Then the overflow is avoided. Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Gustavo A. R. Silva authored
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Gustavo A. R. Silva authored
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Xu Wang authored
Remove unneeded variable i in bch_dirty_init_thread(). Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Xu Wang authored
Using for_each_clear_bit() to simplify the code. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
There are some meta data of bcache are allocated by multiple pages, and they are used as bio bv_page for I/Os to the cache device. for example cache_set->uuids, cache->disk_buckets, journal_write->data, bset_tree->data. For such meta data memory, all the allocated pages should be treated as a single memory block. Then the memory management and underlying I/O code can treat them more clearly. This patch adds __GFP_COMP flag to all the location allocating >0 order pages for the above mentioned meta data. Then their pages are treated as compound pages now. Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jean Delvare authored
registraion -> registration Fixes: 0c8d3fce ("bcache: configure the asynchronous registertion to be experimental") Signed-off-by: Jean Delvare <jdelvare@suse.de> Reviewed-by: Coly Li <colyli@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 24 Jul, 2020 1 commit
-
-
Jens Axboe authored
Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.9/drivers Pull MD fix from Song. * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid5: use do_div() for 64 bit divisions in raid5_sync_request
-
- 23 Jul, 2020 1 commit
-
-
Yufen Yu authored
We get compilation error on 32-bit architectures (e.g. m68k), as: ERROR: modpost: "__udivdi3" [drivers/md/raid456.ko] undefined! Since 'sync_blocks' is defined as u64, use do_div() to fix this error. Fixes: c911c46c ("md/raid456: convert macro STRIPE_* to RAID5_STRIPE_*") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
- 22 Jul, 2020 8 commits
-
-
Jens Axboe authored
Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.9/drivers Pull MD for 5.9 from Song. * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid10: avoid deadlock on recovery. raid: md_p.h: drop duplicated word in a comment md-cluster: fix rmmod issue when md_cluster convert bitmap to none md-cluster: fix safemode_delay value when converting to clustered bitmap md/raid5: support config stripe_size by sysfs entry md/raid5: set default stripe_size as 4096 md/raid456: convert macro STRIPE_* to RAID5_STRIPE_* raid5: remove the meaningless check in raid5_make_request raid5: put the comment of clear_batch_ready to the right place raid5: call clear_batch_ready before set STRIPE_ACTIVE md: raid10: Fix compilation warning md: raid5: Fix compilation warning md: raid5-cache: Remove set but unused variable md: Fix compilation warning
-
Vitaly Mayatskikh authored
When disk failure happens and the array has a spare drive, resync thread kicks in and starts to refill the spare. However it may get blocked by a retry thread that resubmits failed IO to a mirror and itself can get blocked on a barrier raised by the resync thread. Acked-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Vitaly Mayatskikh <vmayatskikh@digitalocean.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Randy Dunlap authored
Drop the doubled word "the" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Song Liu <songliubraving@fb.com>
-
Zhao Heming authored
update_array_info misses calling module_put when removing cluster bitmap. steps to reproduce: ``` node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb mdadm: array /dev/md0 started. node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 1 dlm 212992 14 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster node1 # mdadm -G /dev/md0 -b none node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 1 <== should be zero dlm 212992 9 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster node1 # mdadm -G /dev/md0 -b clustered node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 2 <== increase dlm 212992 14 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster node1 # mdadm -G /dev/md0 -b none node1 # mdadm -G /dev/md0 -b clustered node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 3 <== increase dlm 212992 14 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster ``` Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Zhao Heming authored
When array convert to clustered bitmap, the safe_mode_delay doesn't clean and vice versa. the /sys/block/mdX/md/safe_mode_delay keep original value after changing bitmap type. In safe_delay_store(), the code forbids setting mddev->safemode_delay when array is clustered. So in cluster-md env, the expected safemode_delay value should be 0. Reproducible steps: ``` node1 # mdadm --zero-superblock /dev/sd{b,c,d} node1 # mdadm -C /dev/md0 -b internal -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc node1 # cat /sys/block/md0/md/safe_mode_delay 0.204 node1 # mdadm -G /dev/md0 -b none node1 # mdadm --grow /dev/md0 --bitmap=clustered node1 # cat /sys/block/md0/md/safe_mode_delay 0.204 <== doesn't change, should ZERO for cluster-md node1 # mdadm --zero-superblock /dev/sd{b,c,d} node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc node1 # cat /sys/block/md0/md/safe_mode_delay 0.000 node1 # mdadm -G /dev/md0 -b none node1 # cat /sys/block/md0/md/safe_mode_delay 0.000 <== doesn't change, should default value ``` Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Yufen Yu authored
Adding a new 'stripe_size' sysfs entry to set and show stripe_size. stripe_size should not be bigger than PAGE_SIZE, and it requires to be multiple of 4096. We can adjust stripe_size by writing value into sysfs entry, likely, set stripe_size as 16KB: echo 16384 > /sys/block/md1/md/stripe_size Show current stripe_size value: cat /sys/block/md1/md/stripe_size For PAGE_SIZE is equal to 4096, 'stripe_size' can just be read. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Yufen Yu authored
In RAID5, if issued bio size is bigger than stripe_size, it will be split in the unit of stripe_size and process them one by one. Even for size less then stripe_size, RAID5 also request data from disk at least of stripe_size. Nowdays, stripe_size is equal to the value of PAGE_SIZE. Since filesystem usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE as 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data while RAID5 issue IO at least stripe_size (64KB) each time. That will waste resource of disk bandwidth and compute xor. To avoding the waste, we want to make stripe_size configurable. This patch just set default stripe_size as 4096. User can also set the value bigger than 4KB for some special requirements, such as we know the issued io size is more than 4KB. To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE. 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default configure on /mnt directory. Then, trying to test it by dbench with command: dbench -D /mnt -t 1000 10. Result show as: 'stripe_size = 64KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9805011 0.021 64.728 Close 7202525 0.001 0.120 Rename 415213 0.051 44.681 Unlink 1980066 0.079 93.147 Deltree 240 1.793 6.516 Mkdir 120 0.004 0.007 Qpathinfo 8887512 0.007 37.114 Qfileinfo 1557262 0.001 0.030 Qfsinfo 1629582 0.012 0.152 Sfileinfo 798756 0.040 57.641 Find 3436004 0.019 57.782 WriteX 4887239 0.021 57.638 ReadX 15370483 0.005 37.818 LockX 31934 0.003 0.022 UnlockX 31933 0.001 0.021 Flush 687205 13.302 530.088 Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms ------------------------------------------------------- 'stripe_size = 4KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 11999166 0.021 36.380 Close 8814128 0.001 0.122 Rename 508113 0.051 29.169 Unlink 2423242 0.070 38.141 Deltree 300 1.885 7.155 Mkdir 150 0.004 0.006 Qpathinfo 10875921 0.007 35.485 Qfileinfo 1905837 0.001 0.032 Qfsinfo 1994304 0.012 0.125 Sfileinfo 977450 0.029 26.489 Find 4204952 0.019 9.361 WriteX 5981890 0.019 27.804 ReadX 18809742 0.004 33.491 LockX 39074 0.003 0.025 UnlockX 39074 0.001 0.014 Flush 841022 10.712 458.848 Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms ------------------------------------------------------- It show that setting stripe_size as 4KB has higher thoughput, i.e. (376.777 vs 307.799) and has smaller latency than that setting as 64KB. 2) We try to evaluate IO throughput for /dev/md5 by fio with config: [4KB randwrite] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=4KB rw=randwrite [64KB write] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=1MB rw=write The result as follow: + + | stripe_size(64KB) | stripe_size(4KB) +----------------------------------------------------+ 4KB randwrite | 15MB/s | 100MB/s +----------------------------------------------------+ 1MB write | 1000MB/s | 700MB/s The result show that when size of io is bigger than 4KB (64KB), 64KB stripe_size has much higher IOPS. But for 4KB randwrite, that means, size of io issued to device are smaller, 4KB stripe_size have better performance. Normally, default value (4096) can get relatively good performance. But if each issued io is bigger than 4096, setting value more than 4096 may get better performance. Here, we just set default stripe_size as 4096, and we will try to support setting different stripe_size by sysfs interface in the following patch. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Yufen Yu authored
Convert macro STRIPE_SIZE, STRIPE_SECTORS and STRIPE_SHIFT to RAID5_STRIPE_SIZE(), RAID5_STRIPE_SECTORS() and RAID5_STRIPE_SHIFT(). This patch is prepare for the following adjustable stripe_size. It will not change any existing functionality. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
- 16 Jul, 2020 7 commits
-
-
Guoqing Jiang authored
We can't guarntee the batched stripe to be set with STRIPE_HANDLE since there are lots of functions could set the flag, such as sync_request, ops_complete_* and end_{read,write}_request etc. Also clear_batch_ready called in handle_stripe ensures the batched list can't continue to be handled by handle_stripe. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Guoqing Jiang authored
To make people understand the function well, let's put the comment to the right place. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Guoqing Jiang authored
We tried to only put the head sh of batch list to handle_list, then the handle_stripe doesn't handle other members in the batch list. However, we still got the calltrace in break_stripe_batch_list. [593764.644269] stripe state: 2003 kernel: [593764.644299] ------------[ cut here ]------------ kernel: [593764.644308] WARNING: CPU: 12 PID: 856 at drivers/md/raid5.c:4625 break_stripe_batch_list+0x203/0x240 [raid456] [...] kernel: [593764.644363] Call Trace: kernel: [593764.644370] handle_stripe+0x907/0x20c0 [raid456] kernel: [593764.644376] ? __wake_up_common_lock+0x89/0xc0 kernel: [593764.644379] handle_active_stripes.isra.57+0x35f/0x570 [raid456] kernel: [593764.644382] ? raid5_wakeup_stripe_thread+0x96/0x1f0 [raid456] kernel: [593764.644385] raid5d+0x480/0x6a0 [raid456] kernel: [593764.644390] ? md_thread+0x11f/0x160 kernel: [593764.644392] md_thread+0x11f/0x160 kernel: [593764.644394] ? wait_woken+0x80/0x80 kernel: [593764.644396] kthread+0xfc/0x130 kernel: [593764.644398] ? find_pers+0x70/0x70 kernel: [593764.644399] ? kthread_create_on_node+0x70/0x70 kernel: [593764.644401] ret_from_fork+0x1f/0x30 As we can see, the stripe was set with STRIPE_ACTIVE and STRIPE_HANDLE, and only handle_stripe could set those flags then return. And since the stipe was already in the batch list, we need to return earlier before set the two flags. And after dig a little about git history especially commit 3664847d ("md/raid5: fix a race condition in stripe batch"), it seems the batched stipe still could be handled by handle_stipe, then handle_stipe needs to return earlier if clear_batch_ready to return true. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Damien Le Moal authored
Remove the if statement around the call to sysfs_link_rdev() in raid10_start_reshape() to avoid the compilation warning: warning: suggest braces around empty body in an ‘if’ statement when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Damien Le Moal authored
Remove the if statement around the calls to sysfs_link_rdev() to avoid the compilation warning "suggest braces around empty body in an ‘if’ statement" when compiling with W=1. Also fix function description comments to avoid kdoc format warnings. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Damien Le Moal authored
Remove the variable offset in r5c_tree_index() to avoid a "set but not used" compilation warning when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Damien Le Moal authored
Remove the if statement around the calls to sysfs_link_rdev() to avoid the compilation warnings: warning: suggest braces around empty body in an ‘if’ statement when compiling with W=1. For the call to sysfs_create_link() generating the same warning, use the err variable to store the function result, avoiding triggering another warning as the function is declared as 'warn_unused_result'. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
- 15 Jul, 2020 1 commit
-
-
Niklas Cassel authored
Add a new max_active zones definition in the sysfs documentation. This definition will be common for all devices utilizing the zoned block device support in the kernel. Export max_active_zones according to this new definition for NVMe Zoned Namespace devices, ZAC ATA devices (which are treated as SCSI devices by the kernel), and ZBC SCSI devices. Add the new max_active_zones member to struct request_queue, rather than as a queue limit, since this property cannot be split across stacking drivers. For SCSI devices, even though max active zones is not part of the ZBC/ZAC spec, export max_active_zones as 0, signifying "no limit". Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-