- 10 Jul, 2014 29 commits
-
-
Monam Agarwal authored
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
If we already "pulled ahead", we can short-circuit, and avoid logging the same messages over and over again. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
If "dirty" blocks are written to during resync, that brings them in-sync. By explicitly requesting write-acks during resync even in protocol != C, we now can actually respect this. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Philipp Reisner authored
In setups involving a DRBD-proxy and connections that experience a lot of buffer-bloat it might be necessary to set ping-timeout to an unusual high value. By default DRBD uses the same value to wait if a newly established TCP-connection is stable. Since the DRBD-proxy is usually located in the same data center such a long wait time may hinder DRBD's connect process. In such setups socket-check-timeout should be set to at least to the round trip time between DRBD and DRBD-proxy. I.e. in most cases to 1. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Philipp Reisner authored
Before the patch 'drbd: Keep the listening socket open while trying to connect to the peer' the newly created socket inherited the receive timeout from the listen socket. The listen socket had a receive timeout of connect-intervall +- 30% random jitter. The real issue is that after the mentioned patch we had no timeout at all. Now use 4 times the ping-timeout. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
Checksum based resync trades CPU cycles for network bandwidth, in situations where we expect much of the to-be-resynced blocks to be actually identical on both sides already. In a "network hickup" scenario, it won't help: all to-be-resynced blocks will typically be different. The use case is for the resync of *potentially* different blocks after crash recovery -- the crash recovery had marked larger areas (those covered by the activity log) as need-to-be-resynced, just in case. Most of those blocks will be identical. This option makes it possible to configure checksum based resync, but only actually use it for the first resync after primary crash. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
During handshake, we compare backend sizes, and user set limits, and agree on what device size we are going to expose. We remember that last-agreed-size in our meta data. But if we come up diskless, we have to accept what the peer presents us with. We used to accept the peers maximum potential capacity (backend size), which is wrong, and could lead to IO errors due to access beyond end of device. Instead, we need to accept the peer's current size. Unless that is communicated as 0, in which case we accept the backend size, or the user set limit, if set. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
We intentionally do not serialize /proc/drbd access with internal state changes or statistic updates. Because of that, cat /proc/drbd may race with resync just being finished, still see the sync state, and find information about number of blocks still to go, but then find the total number of blocks within this resync has just been reset to 0 when accessing it. This now produces bogus numbers in the resync speed estimates. Fix by accessing all relevant data only once, and fixing it up if "still to go" happens to be more than "total". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Andreas Gruenbacher authored
Get rid of dump_stack() debug statements. There is no point whatsoever in registering and unregistering a reboot notifier that doesn't do anything. The intention was to switch to an "emergency read-only" mode, so we won't have to resync the full activity log just because we had been Primary before the reboot. Once we have that implemented, we may re-introduce the reboot notifier. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
The textual representation of resync extents in /proc/drbd presented with proc_details >= 3 was wrong, it used bitnumbers as bitmasks. It was not particularly useful either, and I doubt anyone has even tried to look at it in the last few years. Drop it. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
The last user was al_write_transaction, if called with "delegate", and the last user to call it with "delegate = true" was the receiver thread, which has no need to delegate, but can call it himself. Finally drop the delegate parameter, drop the extra w_al_write_transaction callback, and drop drbd_queue_work_front. Do not (yet) change dequeue_work_item to dequeue_work_batch, though. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
This replaces the md_sync_work member of struct drbd_device by a new MD_SYNC "work bit" in device->flags. This replaces the resync_start_work member of struct drbd_device by a new RS_START "work bit" in device->flags. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
The recent fix to put_ldev() (correct ordering of access to local_cnt and state.disk; memory barrier in __drbd_set_state) guarantees that the cleanup happens exactly once. However it does not yet guarantee that the cleanup happens from worker context, the last put_ldev() may still happen from atomic context, which must not happen: blkdev_put() may sleep. Fix this by scheduling the cleanup to the worker instead, using a couple more bits in device->flags and a new helper, drbd_device_post_work(). Generalized the "resync progress" work to cover these new work bits. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: bd_release+0x21/0x70 Process drbd_w_t7146 Call Trace: close_bdev_exclusive drbd_free_ldev [drbd] drbd_ldev_destroy [drbd] w_after_state_ch [drbd] Race probably went like this: state.disk = D_FAILED ... first one to hit zero during D_FAILED: put_ldev() /* ----------------> 0 */ i = atomic_dec_return() if (i == 0) if (state.disk == D_FAILED) schedule_work(go_diskless) /* 1 <------ */ get_ldev_if_state() go_diskless() do_some_pre_cleanup() corresponding put_ldev(): force_state(D_DISKLESS) /* 0 <------ */ i = atomic_dec_return() if (i == 0) atomic_inc() /* ---------> 1 */ state.disk = D_DISKLESS schedule_work(after_state_ch) /* execution pre-empted by IRQ ? */ after_state_ch() put_ldev() i = atomic_dec_return() /* 0 */ if (i == 0) if (state.disk == D_DISKLESS) if (state.disk == D_DISKLESS) drbd_ldev_destroy() drbd_ldev_destroy(); Trying to fix this by checking the disk state *before* the atomic_dec_return(), which implies memory barriers, and by inserting extra memory barriers around the state assignment in __drbd_set_state(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
For some reason we have assumed NOIDLE was implied by one of the other flags we set. It is not (anymore?). Explicitly set REQ_NOIDLE for synchronous meta data updates, or we can seriously starve random writes when using CFQ. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
This probably does not have any real life impact, but we should first persist any potentially new UUID and other meta data flags, as well as our new role, before we allow/disallow write access. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
This should reduce latency for such in-DRBD-protocol "pings", and may help reduce spurious disconnect/reconnect cycles due to "PingAck did not arrive in time." Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
The conf_update mutex used to be held while clearing the net_conf->discard_my_data flag inside drbd_set_role. It was moved into drbd_adm_set_role with drbd: allow parallel promote/demote actions but then replaced at that location by the newly introduced adm_mutex with drbd: Fix a potential deadlock in drbdsetup, introduce resource->adm_mutex And I simply forgot to put it back in at the original location. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
If we re-write all meta data due to resize, we have open-coded write-out of our meta data super block. Stop the md_sync_timer, it would just trigger scary but in this case spurious "timer expired" messages. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
This fixes one recent regresion, and one long existing bug. The bug: drbd_try_clear_on_disk_bm() assumed that all "count" bits have to be accounted in the resync extent corresponding to the start sector. Since we allow application requests to cross our "extent" boundaries, this assumption is no longer true, resulting in possible misaccounting, scary messages ("BAD! sector=12345s enr=6 rs_left=-7 rs_failed=0 count=58 cstate=..."), and potentially, if the last bit to be cleared during resync would reside in previously misaccounted resync extent, the resync would never be recognized as finished, but would be "stalled" forever, even though all blocks are in sync again and all bits have been cleared... The regression was introduced by drbd: get rid of atomic update on disk bitmap works For an "empty" resync (rs_total == 0), we must not "finish" the resync on the SyncSource before the SyncTarget knows all relevant information (sync uuid). We need to wait for the full round-trip, the SyncTarget will then explicitly notify us. Also for normal, non-empty resyncs (rs_total > 0), the resync-finished condition needs to be tested before the schedule() in wait_for_work, or it is likely to be missed. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
We may implicitly call drbd_send() from inside wait_for_work(), via maybe_send_barrier(). If the "stop" signal was send just before that, drbd_send() would call flush_signals(), and we would run an unbounded schedule() afterwards. Fix: check for thread_state == RUNNING before we schedule() Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
Just trigger the occasional lazy bitmap write-out during resync from the central wait_for_work() helper. Previously, during resync, bitmap pages would be written out separately, synchronously, one at a time, at least 8 times each (every 512 bytes worth of bitmap cleared). Now we trigger "merge friendly" bulk write out of all cleared pages every two seconds during resync, and once the resync is finished. Most pages will be written out only once. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
Previously, once you disabled flushes as a means of enforcing write-ordering, you'd need to detach/re-attach to enable them again. Allow drbdsetup disk-options to re-enable previously disabled write-ordering policy options at runtime. While at it fix RCU in drbd_bump_write_ordering() max_allowed_wo() uses rcu_dereference, therefore it must be called within rcu_read_lock()/rcu_read_unlock() Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
Reduce the number of calls to first_peer_device(). Instead, call first_peer_device() just once to assign a local variable peer_device. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Lars Ellenberg authored
Instead of dropping and re-aquiring the spinlock around the submit, just remember that we want to submit, and do that only once we have dropped the spinlock for good. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Philipp Reisner authored
Since the member of drbd_device is called ldev Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Philipp Reisner authored
Some parts of the code assumed that get_ldev_if_state(device, D_ATTACHING) is sufficient to access the ldev member of the device object. That was wrong. ldev may not be there or might be freed at any time if the device has a disk state of D_ATTACHING. bm_rw() Documented that drbd_bm_read() is only called from drbd_adm_attach. drbd_bm_write() is only called when a reference is held, and it is documented that a caller has to hold a reference before calling drbd_bm_write() drbd_bm_write_page() Use get_ldev() instead of get_ldev_if_state(device, D_ATTACHING) drbd_bmio_set_n_write() No longer use get_ldev_if_state(device, D_ATTACHING). All callers hold a reference to ldev now. drbd_bmio_clear_n_write() All callers where holding a reference of ldev anyways. Remove the misleading get_ldev_if_state(device, D_ATTACHING) drbd_reconsider_max_bio_size() Removed the get_ldev_if_state(device, D_ATTACHING). All callers now pass a struct drbd_backing_dev* when they have a proper reference, or a NULL pointer. Before this fix, the receiver could trigger a NULL pointer deref when in drbd_reconsider_max_bio_size() drbd_bump_write_ordering() Used get_ldev_if_state(device, D_ATTACHING) with the wrong assumption. Remove it, and allow the caller to pass in a struct drbd_backing_dev* when the caller knows that accessing this bdev is safe. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
Philipp Reisner authored
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
-
- 01 Jul, 2014 11 commits
-
-
Ming Lei authored
Firstly this patch supports more than one virtual queues for virtio-blk device. Secondly this patch maps the virtual queue to blk-mq's hardware queue. With this approach, both scalability and performance can be improved. Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Ming Lei authored
Current virtio-blk spec only supports one virtual queue for transfering data between VM and host, and inside VM all kinds of operations on the virtual queue needs to hold one lock, so cause below problems: - bad scalability - bad throughput This patch requests to introduce feature of VIRTIO_BLK_F_MQ so that more than one virtual queues can be used to virtio-blk device, then above problems can be solved or eased. Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Douglas Gilbert authored
After the SG_IO ioctl was copied into the block layer and later into the bsg driver, subtle differences emerged. One difference is the way injected commands are queued through the block layer (i.e. this is not SCSI device queueing nor SATA NCQ). Summarizing: - SG_IO on block layer device: blk_exec*(at_head=false) - sg device SG_IO: at_head=true - bsg device SG_IO: at_head=true Some time ago Boaz Harrosh introduced a sg v4 flag called BSG_FLAG_Q_AT_TAIL to override the bsg driver default. A recent patch titled: "sg: add SG_FLAG_Q_AT_TAIL flag" allowed the sg driver default to be overridden. This patch allows a SG_IO ioctl sent to a block layer device to have its default overridden. ChangeLog: - introduce SG_FLAG_Q_AT_HEAD flag in sg.h to cause commands that are injected via a block layer device SG_IO ioctl to set at_head=true - make comments clearer about queueing in sg.h since the header is used both by the sg device and block layer device implementations of the SG_IO ioctl. - introduce BSG_FLAG_Q_AT_HEAD in bsg.h for compatibility (it does nothing) and update comments. Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Akinobu Mita authored
SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved buffer in bytes as int type. The value needs to be capped at the request queue's max_sectors. But integer overflow is not correctly handled in the calculation when converting max_sectors from sectors to bytes. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: linux-scsi@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Akinobu Mita authored
BLKSECTGET ioctl loads the request queue's max_sectors as unsigned short value to the argument pointer. So if the max_sector is greater than USHRT_MAX, the upper 16 bits of that is just discarded. In such case, USHRT_MAX is more preferable than the lower 16 bits of max_sectors. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: linux-scsi@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Fabian Frederick authored
Adding function documentation and fixing kerneldoc warnings ('field: description' uniformization). Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Fabian Frederick authored
checkpatch fixing: WARNING: Missing a blank line after declarations WARNING: space prohibited between function name and open parenthesis '(' ERROR: spaces required around that '<' (ctx:VxV) Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Fabian Frederick authored
Also add no prefix pr_fmt to avoid any future default format update Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Fabian Frederick authored
kcalloc manages count*sizeof overflow. Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Gu Zheng authored
Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from Martin introduces the function bip_integrity_vecs(get the useful vectors) to fix the issue about nr_vecs for inline integrity vectors that reported by David Milburn. But it seems that bip_integrity_vecs() will return the wrong number if the bio is not based on any bio_set for some reason(bio->bi_pool == NULL), because in that case, the bip_inline_vecs[0] is malloced directly. So here we add the bip_max_vcnt to record the count of vector slots, and cleanup the function bip_integrity_vecs(). Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
Tejun Heo authored
Currently, blk-mq uses a percpu_counter to keep track of how many usages are in flight. The percpu_counter is drained while freezing to ensure that no usage is left in-flight after freezing is complete. blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this per-cpu gating mechanism. This type of code has relatively high chance of subtle bugs which are extremely difficult to trigger and it's way too hairy to be open coded in blk-mq. percpu_ref can serve the same purpose after the recent changes. This patch replaces the open-coded per-cpu usage counting and draining mechanism with percpu_ref. blk_mq_queue_enter() performs tryget_live on the ref and exit() performs put. blk_mq_freeze_queue() kills the ref and waits until the reference count reaches zero. blk_mq_unfreeze_queue() revives the ref and wakes up the waiters. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-