Commits · 5757a6d76cdf6dda2a492c09b985c015e86779b1 · nexedi / linux

23 Jul, 2011 3 commits

Dan Williams authored Jul 23, 2011

Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.

Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.

echo 2 > /sys/block/<bdev>/queue/rq_affinity

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

5757a6d7

backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu · ef323088

Mikulas Patocka authored Jul 23, 2011

backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu

synchronize_rcu sleeps several timer ticks. synchronize_rcu_expedited is
much faster.

With 100Hz timer frequency, when we remove 10000 block devices with
"dmsetup remove_all" command, it takes 27 minutes. With this patch,
removing 10000 block devices takes only 15 seconds.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

ef323088

block: fix patch import error in max_discard_sectors check · 4c64500e

Jens Axboe authored Jul 23, 2011

A '!' snuck in before the unlikely, rendering it useless.
Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

4c64500e

13 Jul, 2011 1 commit

block: reorder request_queue to remove 64 bit alignment padding · d7b76301

Richard Kennedy authored Jul 13, 2011

Reorder request_queue to remove 16 bytes of alignment padding in 64 bit
builds.

On my config this shrinks the size of this structure from 1608 to 1592
bytes and therefore needs one fewer cachelines.

Also trivially move the open bracket { to be on the same line as the
structure name to make it easier to grep.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

d7b76301

12 Jul, 2011 4 commits

CFQ: add think time check for group · 7700fc4f

Shaohua Li authored Jul 12, 2011

Currently when the last queue of a group has no request, we don't expire
the queue to hope request from the group comes soon, so the group doesn't
miss its share. But if the think time is big, the assumption isn't correct
and we just waste bandwidth. In such case, we don't do idle.

[global]
runtime=30
direct=1

[test1]
cgroup=test1
cgroup_weight=1000
rw=randread
ioengine=libaio
size=500m
runtime=30
directory=/mnt
filename=file1
thinktime=9000

[test2]
cgroup=test2
cgroup_weight=1000
rw=randread
ioengine=libaio
size=500m
runtime=30
directory=/mnt
filename=file2

	patched		base
test1	64k		39k
test2	548k		540k
total	604k		578k

group1 gets much better throughput because it waits less time.

To check if the patch changes behavior of queue without think time. I also
tried to give test1 2ms think time or no think time. The test result is stable.
The thoughput doesn't change with/without the patch.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

7700fc4f

CFQ: add think time check for service tree · f5f2b6ce

Shaohua Li authored Jul 12, 2011

Currently when the last queue of a service tree has no request, we don't
expire the queue to hope request from the service tree comes soon, so the
service tree doesn't miss its share. But if the think time is big, the
assumption isn't correct and we just waste bandwidth. In such case, we
don't do idle.

[global]
runtime=10
direct=1

[test1]
rw=randread
ioengine=libaio
size=500m
directory=/mnt
filename=file1
thinktime=9000

[test2]
rw=read
ioengine=libaio
size=1G
directory=/mnt
filename=file2

	patched		base
test1	41k/s		33k/s
test2	15868k/s	15789k/s
total	15902k/s	15817k/s

A slightly better

To check if the patch changes behavior of queue without think time. I also
tried to give test1 2ms think time or no think time. The test has variation
even without the patch, but the average throughput doesn't change with/without
the patch.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

f5f2b6ce

CFQ: move think time check variables to a separate struct · 383cd721

Shaohua Li authored Jul 12, 2011

Move the variables to do think time check to a sepatate struct. This is
to prepare adding think time check for service tree and group. No
functional change.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

383cd721

fixlet: Remove fs_excl from struct task. · 4aede84b

Justin TerAvest authored Jul 12, 2011

fs_excl is a poor man's priority inheritance for filesystems to hint to
the block layer that an operation is important. It was never clearly
specified, not widely adopted, and will not prevent starvation in many
cases (like across cgroups).

fs_excl was introduced with the time sliced CFQ IO scheduler, to
indicate when a process held FS exclusive resources and thus needed
a boost.

It doesn't cover all file systems, and it was never fully complete.
Lets kill it.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

4aede84b

10 Jul, 2011 1 commit

cfq: Remove special treatment for metadata rqs. · a07405b7

Justin TerAvest authored Jul 10, 2011

There is no consistency among filesystems from what bios (or requests)
are marked as being metadata. It's interesting to expose this in traces,
but we shouldn't schedule the requests differently based on whether or
not they're marked as being metadata.
Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

a07405b7

08 Jul, 2011 2 commits

block: document blk_plug list access · 316cc67d

Shaohua Li authored Jul 08, 2011

I'm often confused why not disable preempt when changing blk_plug list. It
would be better to add comments here in case others have the similar concerns.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

316cc67d

block: avoid building too big plug list · 55c022bb

Shaohua Li authored Jul 08, 2011

When I test fio script with big I/O depth, I found the total throughput drops
compared to some relative small I/O depth. The reason is the thread accumulates
big requests in its plug list and causes some delays (surely this depends
on CPU speed).
I thought we'd better have a threshold for requests. When a threshold reaches,
this means there is no request merge and queue lock contention isn't severe
when pushing per-task requests to queue, so the main advantages of blk plug
don't exist. We can force a plug list flush in this case.
With this, my test throughput actually increases and almost equals to small
I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
for big I/O depth.
The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
reduce lock contention to me. But I'm open here, 32 is ok in my test too.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

55c022bb

07 Jul, 2011 1 commit

compat_ioctl: fix make headers_check regression · 719c0c59

Johannes Stezenbach authored Jul 07, 2011

Fix headers_check error introduced by 390192b3:

include/linux/fd.h:6: included file 'linux/compat.h' is not exported
Signed-off-by: Johannes Stezenbach <js@sig21.net>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

719c0c59

06 Jul, 2011 1 commit

block: eliminate potential for infinite loop in blkdev_issue_discard · 0f799603

Mike Snitzer authored Jul 06, 2011

Due to the recently identified overflow in read_capacity_16() it was
possible for max_discard_sectors to be zero but still have discards
enabled on the associated device's queue.

Eliminate the possibility for blkdev_issue_discard to infinitely loop.

Interestingly this issue wasn't identified until a device, whose
discard_granularity was 0 due to read_capacity_16 overflow, was consumed
by blk_stack_limits() to construct limits for a higher-level DM
multipath device. The multipath device's resulting limits never had the
discard limits stacked because blk_stack_limits() will only do so if
the bottom device's discard_granularity != 0. This resulted in the
multipath device's limits.max_discard_sectors being 0.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

0f799603

01 Jul, 2011 3 commits

compat_ioctl: fix warning caused by qemu · 390192b3

Johannes Stezenbach authored Jul 01, 2011

On Linux x86_64 host with 32bit userspace, running
qemu or even just "qemu-img create -f qcow2 some.img 1G"
causes a kernel warning:

ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img

ioctl 00005326 is CDROM_DRIVE_STATUS,
ioctl 801c0204 is FDGETPRM.

The warning appears because the Linux compat-ioctl handler for these
ioctls only applies to block devices, while qemu also uses the ioctls on
plain files.
Signed-off-by: Johannes Stezenbach <js@sig21.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

390192b3

block: flush MEDIA_CHANGE from drivers on close(2) · 85ef06d1

Tejun Heo authored Jul 01, 2011

Currently, only open(2) is defined as the 'clearing' point.  It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events.  Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.

The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage.  Among other
things, GET_EVENT can't detect media removal while TUR can.

This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers.  disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.

This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).

Note: Given the current usage, it seems @clearing hint is needlessly
complex.  disk_clear_events() can simply clear all events and the hint
can be boolean @flush.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

85ef06d1

Merge branch 'for-linus' into for-3.1/core · 04bf7869

Jens Axboe authored Jul 01, 2011

Conflicts:
	block/blk-throttle.c
	block/cfq-iosched.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

04bf7869

30 Jun, 2011 8 commits

Merge branch 'for-3.0-important' of git://git.drbd.org/linux-2.6-drbd into for-linus · 7b28afe0
Jens Axboe authored Jun 30, 2011

7b28afe0

drbd: we should write meta data updates with FLUSH FUA · 86e1e98e

Lars Ellenberg authored Jun 28, 2011

We used to write these with BIO_RW_BARRIER aka REQ_HARDBARRIER (unless
disabled in the configuration). The correct semantic now would be to
write with FLUSH/FUA.
For example, with activity log transactions, FUA alone is not enough, we
need the corresponding bitmap update (and all related application
updates) on stable storage as well.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

86e1e98e

drbd: fix limit define, we support 1 PiByte now · 15b493d1

Lars Ellenberg authored Jun 28, 2011

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

15b493d1

drbd: when receive times out on meta socket, also check last receive time on data socket · cb6518cb

Lars Ellenberg authored Jun 20, 2011

If we have an asymetrically congested network, we may send P_PING,
but due to congestion, the corresponding P_PING_ACK would time out,
and we would drop a (congested, but otherwise) healthy connection
("PingAck did not arrive in time.")
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

cb6518cb

drbd: account bitmap IO during resync as resync-(related-)-io · 5a8b4242

Lars Ellenberg authored Jun 14, 2011

If we have a good resync rate, we will frequently update the on-disk
bitmap, which, if not accounted for as resync io, may let an otherwise
idle device appear to be "busy", and cause us to throttle resync.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

5a8b4242

drbd: don't cond_resched_lock with IRQs disabled · 8ccee20e

Lars Ellenberg authored Jun 06, 2011

The last commit, drbd: add missing spinlock to bitmap receive,
introduced a cond_resched_lock(), where the lock in question is taken
with irqs disabled.

As we must not schedule with IRQs disabled,
and cond_resched_lock_irq() does not exist, yet,
we re-aquire the spin_lock_irq() for each bitmap page processed in turn.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

8ccee20e

drbd: add missing spinlock to bitmap receive · 829c6087

Lars Ellenberg authored Jun 03, 2011

During bitmap exchange, when using the RLE bitmap compression scheme,
we have a code path that can set the whole bitmap at once.

To avoid holding spin_lock_irq() for too long, we used to lock out other
bitmap modifications during bitmap exchange by other means, and then,
knowing we have exclusive access to the bitmap, modify it without
the spinlock, and with IRQs enabled.

Since we now allow local IO to continue, potentially setting additional
bits during the bitmap receive phase, this is no longer true, and we get
uncoordinated updates of bitmap members, causing bm_set to no longer
accurately reflect the total number of set bits.

To actually see this, you'd need to have a large bitmap, use RLE bitmap
compression, and have busy IO during sync handshake and bitmap exchange.

Fix this by taking the spin_lock_irq() in this code path as well, but
calling cond_resched_lock() after each page worth of bits processed.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

829c6087

drbd: Use the correct max_bio_size when creating resync requests · 0cfdd247

Philipp Reisner authored May 25, 2011

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

0cfdd247

27 Jun, 2011 4 commits

cfq-iosched: make code consistent · 726e99ab

Shaohua Li authored Jun 27, 2011

ioc->ioc_data is rcu protectd, so uses correct API to access it.
This doesn't change any behavior, but just make code consistent.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: stable@kernel.org # after ab4bd22dSigned-off-by: Jens Axboe <jaxboe@fusionio.com>

726e99ab

cfq-iosched: fix a rcu warning · 3181faa8

Shaohua Li authored Jun 27, 2011

I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but
doesn't hold rcu_read_lock.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: stable@kernel.org # after ab4bd22dSigned-off-by: Jens Axboe <jaxboe@fusionio.com>

3181faa8

Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 · 258e43fd

Linus Torvalds authored Jun 26, 2011

* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: mark CONFIG_CIFS_NFSD_EXPORT as BROKEN
  cifs: free blkcipher in smbhash

258e43fd

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 · 804a007f

Linus Torvalds authored Jun 26, 2011

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  cifs: propagate errors from cifs_get_root() to mount(2)
  cifs: tidy cifs_do_mount() up a bit
  cifs: more breakage on mount failures
  cifs: close sget() races
  cifs: pull freeing mountdata/dropping nls/freeing cifs_sb into cifs_umount()
  cifs: move cifs_umount() call into ->kill_sb()
  cifs: pull cifs_mount() call up
  sanitize cifs_umount() prototype
  cifs: initialize ->tlink_tree in cifs_setup_cifs_sb()
  cifs: allocate mountdata earlier
  cifs: leak on mount if we share superblock
  cifs: don't pass superblock to cifs_mount()
  cifs: don't leak nls on mount failure
  cifs: double free on mount failure
  take bdi setup/destruction into cifs_mount/cifs_umount
Acked-by: Steve French <smfrench@gmail.com>

804a007f

25 Jun, 2011 5 commits

Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 · daf6c450
Steve French authored Jun 25, 2011

daf6c450

Merge branch 'timer-fixes-for-linus' of... · 8abf5588

Linus Torvalds authored Jun 25, 2011

Merge branch 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  rtc: vt8500: Fix build error & cleanup rtc_class_ops->update_irq_enable()
  alarmtimers: Return -ENOTSUPP if no RTC device is present
  alarmtimers: Handle late rtc module loading

8abf5588

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6 · 4d362ad2

Linus Torvalds authored Jun 25, 2011

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: Remove unneeded version.h includes from sound/
  ASoC: pxa-ssp: Correct check for stream presence
  ASoC: imx: add missing module informations
  ASoC: imx: Remove unused Kconfig SND_MXC_SOC_SSI entry
  ALSA: HDA: Pinfix quirk for HP Z200 Workstation
  ALSA: VIA HDA: Create a master amplifier control for VT1718S.
  ALSA: VIA HDA: Mute/unmute mixer conncted to Headphone for VT1718S.
  ALSA: VIA HDA: Modify initial verbs list for VT1718S.
  ALSA: hda - Remove ALC268 model override for CPR2000
  ALSA: HDA: Remove quirk for an HP device
  ASoC: Remove unused and about to be broken SND_SOC_CUSTOM I/O bus

4d362ad2

Merge branch 'fortglx/3.0/tip/timers/rtc' of... · b1eb085c

Thomas Gleixner authored Jun 25, 2011

Merge branch 'fortglx/3.0/tip/timers/rtc' of git://git.linaro.org/people/jstultz/linux into timers/urgent

  * rtc: vt8500: Fix build error & cleanup rtc_class_ops->update_irq_enable()

b1eb085c

Merge branch 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6 · 536142f9

Linus Torvalds authored Jun 24, 2011

* 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6:
  drm/i915: save/resume forcewake lock fixes
  Revert "drm/i915: Kill GTT mappings when moving from GTT domain"
  drm/i915: Apply HWSTAM workaround for BSD ring on SandyBridge
  drm/i915: Call intel_enable_plane from i9xx_crtc_mode_set (again)

536142f9

24 Jun, 2011 7 commits

cifs: propagate errors from cifs_get_root() to mount(2) · 9403c9c5

Al Viro authored Jun 17, 2011

... instead of just failing with -EINVAL
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9403c9c5

cifs: tidy cifs_do_mount() up a bit · 5c4f1ad7

Al Viro authored Jun 17, 2011

Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

5c4f1ad7

cifs: more breakage on mount failures · fa18f1bd

Al Viro authored Jun 17, 2011

if cifs_get_root() fails, we end up with ->mount() returning NULL,
which is not what callers expect.  Moreover, in case of superblock
reuse we end up leaking a superblock reference...
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fa18f1bd

cifs: close sget() races · ee01a14d

Al Viro authored Jun 17, 2011

have ->s_fs_info set by the set() callback passed to sget()
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ee01a14d

cifs: pull freeing mountdata/dropping nls/freeing cifs_sb into cifs_umount() · d757d71b

Al Viro authored Jun 17, 2011

all callers of cifs_umount() proceed to do the same thing; pull it into
cifs_umount() itself.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d757d71b

cifs: move cifs_umount() call into ->kill_sb() · 98ab494d

Al Viro authored Jun 17, 2011

instead of calling it manually in case if cifs_read_super() fails
to set ->s_root, just call it from ->kill_sb().  cifs_put_super()
is gone now *and* we have cifs_sb shutdown and destruction done
after the superblock is gone from ->s_instances.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

98ab494d

cifs: pull cifs_mount() call up · 97d1152a

Al Viro authored Jun 17, 2011

... to the point prior to sget().  Now we have cifs_sb set up early
enough.
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

97d1152a