1. 09 Nov, 2012 37 commits
  2. 08 Nov, 2012 3 commits
    • Lars Ellenberg's avatar
      drbd: flush drbd work queue before invalidate/invalidate remote · 970fbde1
      Lars Ellenberg authored
      If you do back to back wait-sync/invalidate on a Primary in a tight loop,
      during application IO load, you could trigger a race:
        kernel: block drbd6: FIXME going to queue 'set_n_write from StartingSync'
          but 'write from resync_finished' still pending?
      
      Fix this by changing the order of the drbd_queue_work() and
      the wake_up() in dec_ap_pending(), and adding the additional
      drbd_flush_workqueue() before requesting the full sync.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      970fbde1
    • Lars Ellenberg's avatar
      drbd: call local-io-error handler early · 6f1a6563
      Lars Ellenberg authored
      In case we want to hard-reset from the local-io-error handler,
      we need to call it before notifying the peer or aborting local IO.
      Otherwise the peer will advance its data generation UUIDs even
      if secondary.
      
      This way, local io error looks like a "regular" node crash,
      which reduces the number of different failure cases.
      This may be useful in a bigger picture where crashed or otherwise
      "misbehaving" nodes are automatically re-deployed.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      6f1a6563
    • Lars Ellenberg's avatar
      drbd: do not reset rs_pending_cnt too early · a324896b
      Lars Ellenberg authored
      Fix asserts like
        block drbd0: in got_BlockAck:4634: rs_pending_cnt = -35 < 0 !
      
      We reset the resync lru cache and related information (rs_pending_cnt),
      once we successfully finished a resync or online verify, or if the
      replication connection is lost.
      
      We also need to reset it if a resync or online verify is aborted
      because a lower level disk failed.
      
      In that case the replication link is still established,
      and we may still have packets queued in the network buffers
      which want to touch rs_pending_cnt.
      
      We do not have any synchronization mechanism to know for sure when all
      such pending resync related packets have been drained.
      
      To avoid this counter to go negative (and violate the ASSERT that it
      will always be >= 0), just do not reset it when we lose a disk.
      
      It is good enough to make sure it is re-initialized before the next
      resync can start: reset it when we re-attach a disk.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      a324896b