- 28 Mar, 2017 21 commits
-
-
Omar Sandoval authored
Compile-tested only. Cc: Tim Waugh <tim@cyberelk.net> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
Compile-tested only. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
One hard problem adding .low limit is to detect idle cgroup. If one cgroup doesn't dispatch enough IO against its low limit, we must have a mechanism to determine if other cgroups dispatch more IO. We added the think time detection mechanism before, but it doesn't work for all workloads. Here we add a latency based approach. We already have mechanism to calculate latency threshold for each IO size. For every IO dispatched from a cgorup, we compare its latency against its threshold and record the info. If most IO latency is below threshold (in the code I use 75%), the cgroup could be treated idle and other cgroups can dispatch more IO. Currently this latency target check is only for SSD as we can't calcualte the latency target for hard disk. And this is only for cgroup leaf node so far. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
User configures latency target, but the latency threshold for each request size isn't fixed. For a SSD, the IO latency highly depends on request size. To calculate latency threshold, we sample some data, eg, average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency threshold of each request size will be the sample latency (I'll call it base latency) plus latency target. For example, the base latency for request size 4k is 80us and user configures latency target 60us. The 4k latency threshold will be 80 + 60 = 140us. To sample data, we calculate the order base 2 of rounded up IO sectors. If the IO size is bigger than 1M, it will be accounted as 1M. Since the calculation does round up, the base latency will be slightly smaller than actual value. Also if there isn't any IO dispatched for a specific IO size, we will use the base latency of smaller IO size for this IO size. But we shouldn't sample data at any time. The base latency is supposed to be latency where disk isn't congested, because we use latency threshold to schedule IOs between cgroups. If disk is congested, the latency is higher, using it for scheduling is meaningless. Hence we only do the sampling when block throttling is in the LOW limit, with assumption disk isn't congested in such state. If the assumption isn't true, eg, low limit is too high, calculated latency threshold will be higher. Hard disk is completely different. Latency depends on spindle seek instead of request size. Currently this feature is SSD only, we probably can use a fixed threshold like 4ms for hard disk though. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
Currently there is no way to know the request size when the request is finished. Next patch will need this info. We could add extra field to record the size, but blk_issue_stat has enough space to record it, so this patch just overloads blk_issue_stat. With this, we will have 49bits to track time, which still is very long time. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
Here we introduce per-cgroup latency target. The target determines how a cgroup can afford latency increasement. We will use the target latency to calculate a threshold and use it to schedule IO for cgroups. If a cgroup's bandwidth is below its low limit but its average latency is below the threshold, other cgroups can safely dispatch more IO even their bandwidth is higher than their low limits. On the other hand, if the first cgroup's latency is higher than the threshold, other cgroups are throttled to their low limits. So the target latency determines how we efficiently utilize free disk resource without sacifice of worload's IO latency. For example, assume 4k IO average latency is 50us when disk isn't congested. A cgroup sets the target latency to 30us. Then the cgroup can accept 50+30=80us IO latency. If the cgroupt's average IO latency is 90us and its bandwidth is below low limit, other cgroups are throttled to their low limit. If the cgroup's average IO latency is 60us, other cgroups are allowed to dispatch more IO. When other cgroups dispatch more IO, the first cgroup's IO latency will increase. If it increases to 81us, we then throttle other cgroups. User will configure the interface in this way: echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low latency is in microsecond unit By default, latency target is 0, which means to guarantee IO latency. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
Last patch introduces a way to detect idle cgroup. We use it to make upgrade/downgrade decision. And the new algorithm can detect completely idle cgroup too, so we can delete the corresponding code. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
Add interface to configure the threshold. The io.low interface will like: echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low idle is in microsecond unit. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
A cgroup gets assigned a low limit, but the cgroup could never dispatch enough IO to cross the low limit. In such case, the queue state machine will remain in LIMIT_LOW state and all other cgroups will be throttled according to low limit. This is unfair for other cgroups. We should treat the cgroup idle and upgrade the state machine to lower state. We also have a downgrade logic. If the state machine upgrades because of cgroup idle (real idle), the state machine will downgrade soon as the cgroup is below its low limit. This isn't what we want. A more complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But when queue gets upgraded to lower state, other cgroups could dispatch more IO and this cgroup can't dispatch enough IO, so the cgroup is below its low limit and looks like idle (fake idle). In this case, the queue should downgrade soon. The key to determine if we should do downgrade is to detect if cgroup is truely idle. Unfortunately it's very hard to determine if a cgroup is real idle. This patch uses the 'think time check' idea from CFQ for the purpose. Please note, the idea doesn't work for all workloads. For example, a workload with io depth 8 has disk utilization 100%, hence think time is 0, eg, not idle. But the workload can run higher bandwidth with io depth 16. Compared to io depth 16, the io depth 8 workload is idle. We use the idea to roughly determine if a cgroup is idle. We treat a cgroup idle if its think time is above a threshold (by default 1ms for SSD and 100ms for HD). The idea is think time above the threshold will start to harm performance. HD is much slower so a longer think time is ok. The patch (and the latter patches) uses 'unsigned long' to track time. We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses precision, should not a big deal. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
When cgroups all reach low limit, cgroups can dispatch more IO. This could make some cgroups dispatch more IO but others not, and even some cgroups could dispatch less IO than their low limit. For example, cg1 low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is 120M/s for the workload. Their bps could something like this: cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80 At T1, all cgroups reach low limit, so they can dispatch more IO later. Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2 will have bandwidth below its low limit at most time. The big problem here is we don't know the maximum bandwidth of the workload, so we can't make smart decision to avoid the situation. This patch makes cgroup bandwidth change smooth. After disk upgrades from LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto their max limit immediately. Their bandwidth limit will be increased gradually to avoid above situation. So above example will became something like: cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80 -> 45/75 -> 22/98 In this way cgroups bandwidth will be above their limit in majority time, this still doesn't fully utilize disk bandwidth, but that's something we pay for sharing. Scale up is linear. The limit scales up 1/2 .low limit every throtl_slice after upgrade. The scale up will stop if the adjusted limit hits .max limit. Scale down is exponential. We cut the scale value half if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then fully downgrade the queue to LIMIT_LOW state. Note this doesn't completely avoid cgroup running under its low limit. The best way to guarantee cgroup doesn't run under its limit is to set max limit. For example, if we set cg1 max limit to 40, cg2 will never run under its low limit. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the cgroup is idle. When this happens, the cgroup doesn't hit its limit, so we can't move the state machine to higher level and all cgroups will be throttled to their lower limit, so we waste bandwidth. Detecting idle cgroup is hard. This patch handles a simple case, a cgroup doesn't dispatch any IO. We ignore such cgroup's limit, so other cgroups can use the bandwidth. Please note this will be replaced with a more sophisticated algorithm later, but this demonstrates the idea how we handle idle cgroups, so I leave it here. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
The throtl_slice is 100ms by default. This is a long time for SSD, a lot of IO can run. To make cgroups have smoother throughput, we choose a small value (20ms) for SSD. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
throtl_slice is important for blk-throttling. It's called slice internally but it really is a time window blk-throttling samples data. blk-throttling will make decision based on the samplings. An example is bandwidth measurement. A cgroup's bandwidth is measured in the time interval of throtl_slice. A small throtl_slice meanse cgroups have smoother throughput but burn more CPUs. It has 100ms default value, which is not appropriate for all disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes it tunable. Since throtl_slice isn't a time slice, the sysfs name 'throttle_sample_time' reflects its character better. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
cgroup could be throttled to a limit but when all cgroups cross high limit, queue enters a higher state and so the group should be throttled to a higher limit. It's possible the cgroup is sleeping because of throttle and other cgroups don't dispatch IO any more. In this case, nobody can trigger current downgrade/upgrade logic. To fix this issue, we could either set up a timer to wakeup the cgroup if other cgroups are idle or make sure this cgroup doesn't sleep too long. Setting up a timer means we must change the timer very frequently. This patch chooses the latter. Making cgroup sleep time not too big wouldn't change cgroup bps/iops, but could make it wakeup more frequently, which isn't a big issue because throtl_slice * 8 is already quite big. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
When queue state machine is in LIMIT_MAX state, but a cgroup is below its low limit for some time, the queue should be downgraded to lower state as one cgroup's low limit isn't met. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
When queue is in LIMIT_LOW state and all cgroups with low limit cross the bps/iops limitation, we will upgrade queue's state to LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if the cgroup has pending request. Since cgroup is throttled according to the limit, pending request means the cgroup reaches the limit. If a cgroup has limit set for both read and write, we consider the combination of them for upgrade. The reason is read IO and write IO can interfere with each other. If we do the upgrade based in one direction IO, the other direction IO could be severly harmed. For a cgroup hierarchy, there are two cases. Children has lower low limit than parent. Parent's low limit is meaningless. If children's bps/iops cross low limit, we can upgrade queue state. The other case is children has higher low limit than parent. Children's low limit is meaningless. As long as parent's bps/iops (which is a sum of childrens bps/iops) cross low limit, we can upgrade queue state. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
each queue will have a state machine. Initially queue is in LIMIT_LOW state, which means all cgroups will be throttled according to their low limit. After all cgroups with low limit cross the limit, the queue state gets upgraded to LIMIT_MAX state. For max limit, cgroup will use the limit configured by user. For low limit, cgroup will use the minimal value between low limit and max limit configured by user. If the minimal value is 0, which means the cgroup doesn't configure low limit, we will use max limit to throttle the cgroup and the cgroup is ready to upgrade to LIMIT_MAX Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
Add low limit for cgroup and corresponding cgroup interface. To be consistent with memcg, we allow users configure .low limit higher than .max limit. But the internal logic always assumes .low limit is lower than .max limit. So we add extra bps/iops_conf fields in throtl_grp for userspace configuration. Old bps/iops fields in throtl_grp will be the actual limit we use for throttling. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
As discussed in LSF, add configure option for the interface and mark it as experimental, so people can try/test. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
We are going to support low/max limit, each cgroup will have 2 limits after that. This patch prepares for the multiple limits change. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
clean up the code to avoid using -1 Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
- 25 Mar, 2017 1 commit
-
-
Eric Biggers authored
blk_integrity_profile's are never modified, so mark them 'const' so that they are placed in .rodata and benefit from memory protection. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
- 24 Mar, 2017 4 commits
-
-
Eric Biggers authored
blkdev_issue_flush() is now always synchronous, and it no longer has a flags argument. So remove the part of the comment about the WAIT flag. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Eric Biggers authored
BLKDEV_IFL_* flags no longer exist; blkdev_issue_discard() now actually takes BLKDEV_DISCARD_* flags. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Geliang Tang authored
Use setup_timer() instead of init_timer() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Geliang Tang authored
Use setup_timer() instead of init_timer() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
- 23 Mar, 2017 14 commits
-
-
Dan Carpenter authored
There isn't a bug here, but Smatch is not smart enough to know that "nr_iovecs" can't be negative so it complains about underflows. Really, it's slightly cleaner to make this parameter unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Turn the different ways of merging or issuing I/O into a series of if/else statements instead of the current maze of gotos. Note that this means we pin the CPU a little longer for some cases as the CTX put is moved to common code at the end of the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Now that we have a nice direct issue heper this helps simplifying the code a bit, and also gets rid of the old_rq variable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a new wrapper that takes care of RCU / SRCU locking to avoid having boileplate code in the caller which would get duplicated with new callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
They are mostly the same code anyway - this just one small conditional for the plug case that is different for both variants. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
This flag was never used since it was introduced. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jan Kara authored
When device open races with device shutdown, we can get the following oops in scsi_disk_get(): [11863.044351] general protection fault: 0000 [#1] SMP [11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop] [11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W 4.10.0-rc2-xen+ #35 [11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000 [11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70 [11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202 [11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000 [11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880 [11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001 [11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa [11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d [11863.059217] FS: 00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000 [11863.059217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0 [11863.059217] Call Trace: [11863.059217] ? disk_get_part+0x22/0x1f0 [11863.059217] sd_open+0x39/0x130 [11863.059217] __blkdev_get+0x69/0x430 [11863.059217] ? bd_acquire+0x7f/0xc0 [11863.059217] ? bd_acquire+0x96/0xc0 [11863.059217] ? blkdev_get+0x350/0x350 [11863.059217] blkdev_get+0x126/0x350 [11863.059217] ? _raw_spin_unlock+0x2b/0x40 [11863.059217] ? bd_acquire+0x7f/0xc0 [11863.059217] ? blkdev_get+0x350/0x350 [11863.059217] blkdev_open+0x65/0x80 ... As you can see RAX value is already poisoned showing that gendisk we got is already freed. The problem is that get_gendisk() looks up device number in ext_devt_idr and then does get_disk() which does kobject_get() on the disks kobject. However the disk gets removed from ext_devt_idr only in disk_release() (through blk_free_devt()) at which moment it has already 0 refcount and is already on its way to be freed. Indeed we've got a warning from kobject_get() about 0 refcount shortly before the oops. We fix the problem by using kobject_get_unless_zero() in get_disk() so that get_disk() cannot get reference on a disk that is already being freed. Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jan Kara authored
Make the function available for outside use and fortify it against NULL kobject. CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jan Kara authored
When block device is closed, we call inode_detach_wb() in __blkdev_put() which sets inode->i_wb to NULL. That is contrary to expectations that inode->i_wb stays valid once set during the whole inode's lifetime and leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because inode_to_wb() returned NULL. The reason why we called inode_detach_wb() is not valid anymore though. BDI is guaranteed to stay along until we call bdi_put() from bdev_evict_inode() so we can postpone calling inode_detach_wb() to that moment. Also add a warning to catch if someone uses inode_detach_wb() in a dangerous way. Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jan Kara authored
Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called from bdi_unregister() which is not necessarily called from bdi_destroy() and thus the name is somewhat misleading. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jan Kara authored
Currently we wait for all cgwbs to get released in cgwb_bdi_destroy() (called from bdi_unregister()). That is however unnecessary now when cgwb->bdi is a proper refcounted reference (thus bdi cannot get released before all cgwbs are released) and when cgwb_bdi_destroy() shuts down writeback directly. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jan Kara authored
Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy() which also means that writeback has been shutdown on them. Since this wait is going away, directly shutdown writeback on cgwbs from cgwb_bdi_destroy() to avoid live writeback structures after bdi_unregister() has finished. To make that safe with concurrent shutdown from cgwb_release_workfn(), we also have to make sure wb_shutdown() returns only after the bdi_writeback structure is really shutdown. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jan Kara authored
Currently root wb_writeback structure is added to bdi->wb_list in bdi_init() and never removed. That is different from all other wb_writeback structures which get added to the list when created and removed from it before wb_shutdown(). So move list addition of root bdi_writeback to bdi_register() and list removal of all wb_writeback structures to wb_shutdown(). That way a wb_writeback structure is on bdi->wb_list if and only if it can handle writeback and it will make it easier for us to handle shutdown of all wb_writeback structures in bdi_unregister(). Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jan Kara authored
Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback structures except for the one embedded inside struct backing_dev_info. That will allow us to simplify bdi unregistration. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
-