1. 07 May, 2012 9 commits
  2. 05 Apr, 2012 1 commit
    • Alex Elder's avatar
      rbd: don't hold spinlock during messenger flush · cd9d9f5d
      Alex Elder authored
      A recent change made changes to the rbd_client_list be protected by
      a spinlock.  Unfortunately in rbd_put_client(), the lock is taken
      before possibly dropping the last reference to an rbd_client, and on
      the last reference that eventually calls flush_workqueue() which can
      sleep.
      
      The problem was flagged by a debug spinlock warning:
          BUG: spinlock wrong CPU on CPU#3, rbd/27814
      
      The solution is to move the spinlock acquisition and release inside
      rbd_client_release(), which is the spot where it's really needed for
      protecting the removal of the rbd_client from the client list.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Reviewed-by: default avatarSage Weil <sage@newdream.net>
      cd9d9f5d
  3. 22 Mar, 2012 30 commits
    • Josh Durgin's avatar
      rbd: move snap_rwsem to the device, rename to header_rwsem · c666601a
      Josh Durgin authored
      A new temporary header is allocated each time the header changes, but
      only the changed properties are copied over. We don't need a new
      semaphore for each header update.
      
      This addresses http://tracker.newdream.net/issues/2174Signed-off-by: default avatarJosh Durgin <josh.durgin@dreamhost.com>
      Reviewed-by: default avatarAlex Elder <elder@dreamhost.com>
      c666601a
    • Alex Elder's avatar
      ceph: fix three bugs, two in ceph_vxattrcb_file_layout() · 3489b42a
      Alex Elder authored
      In ceph_vxattrcb_file_layout(), there is a check to determine
      whether a preferred PG should be formatted into the output buffer.
      That check assumes that a preferred PG number of 0 indicates "no
      preference," but that is wrong.  No preference is indicated by a
      negative (specifically, -1) PG number.
      
      In addition, if that condition yields true, the preferred value
      is formatted into a sized buffer, but the size consumed by the
      earlier snprintf() call is not accounted for, opening up the
      possibilty of a buffer overrun.
      
      Finally, in ceph_vxattrcb_dir_rctime() where the nanoseconds part of
      the time displayed did not include leading 0's, which led to
      erroneous (sub-second portion of) time values being shown.
      
      This fixes these three issues:
          http://tracker.newdream.net/issues/2155
          http://tracker.newdream.net/issues/2156
          http://tracker.newdream.net/issues/2157Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Reviewed-by: default avatarSage Weil <sage@newdream.net>
      3489b42a
    • Alex Elder's avatar
      libceph: isolate kmap() call in write_partial_msg_pages() · 8d63e318
      Alex Elder authored
      In write_partial_msg_pages(), every case now does an identical call
      to kmap(page).  Instead, just call it once inside the CRC-computing
      block where it's needed.  Move the definition of kaddr inside that
      block, and make it a (char *) to ensure portable pointer arithmetic.
      
      We still don't kunmap() it until after the sendpage() call, in case
      that also ends up needing to use the mapping.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Reviewed-by: default avatarSage Weil <sage@newdream.net>
      8d63e318
    • Alex Elder's avatar
      libceph: rename "page_shift" variable to something sensible · 9bd19663
      Alex Elder authored
      In write_partial_msg_pages() there is a local variable used to
      track the starting offset within a bio segment to use.  Its name,
      "page_shift" defies the Linux convention of using that name for
      log-base-2(page size).
      
      Since it's only used in the bio case rename it "bio_offset".  Use it
      along with the page_pos field to compute the memory offset when
      computing CRC's in that function.  This makes the bio case match the
      others more closely.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Reviewed-by: default avatarSage Weil <sage@newdream.net>
      9bd19663
    • Alex Elder's avatar
      libceph: get rid of zero_page_address · 0cdf9e60
      Alex Elder authored
      There's not a lot of benefit to zero_page_address, which basically
      holds a mapping of the zero page through the life of the messenger
      module.  Even with our own mapping, the sendpage interface where
      it's used may need to kmap() it again.  It's almost certain to
      be in low memory anyway.
      
      So stop treating the zero page specially in write_partial_msg_pages()
      and just get rid of zero_page_address entirely.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Reviewed-by: default avatarSage Weil <sage@newdream.net>
      0cdf9e60
    • Alex Elder's avatar
      libceph: only call kernel_sendpage() via helper · e36b13cc
      Alex Elder authored
      Make ceph_tcp_sendpage() be the only place kernel_sendpage() is
      used, by using this helper in write_partial_msg_pages().
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Reviewed-by: default avatarSage Weil <sage@newdream.net>
      e36b13cc
    • Alex Elder's avatar
      libceph: use kernel_sendpage() for sending zeroes · 31739139
      Alex Elder authored
      If a message queued for send gets revoked, zeroes are sent over the
      wire instead of any unsent data.  This is done by constructing a
      message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg().
      
      Since we are already working with a page in this case we can use
      the sendpage interface instead.  Create a new ceph_tcp_sendpage()
      helper that sets up flags to match the way ceph_tcp_sendmsg()
      does now.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Reviewed-by: default avatarSage Weil <sage@newdream.net>
      31739139
    • Alex Elder's avatar
      libceph: fix inverted crc option logic · 37675b0f
      Alex Elder authored
      CRC's are computed for all messages between ceph entities.  The CRC
      computation for the data portion of message can optionally be
      disabled using the "nocrc" (common) ceph option.  The default is
      for CRC computation for the data portion to be enabled.
      
      Unfortunately, the code that implements this feature interprets the
      feature flag wrong, meaning that by default the CRC's have *not*
      been computed (or checked) for the data portion of messages unless
      the "nocrc" option was supplied.
      
      Fix this, in write_partial_msg_pages() and read_partial_message().
      Also change the flag variable in write_partial_msg_pages() to be
      "no_datacrc" to match the usage elsewhere in the file.
      
      This fixes http://tracker.newdream.net/issues/2064Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Reviewed-by: default avatarSage Weil <sage@newdream.net>
      37675b0f
    • Alex Elder's avatar
      libceph: some simple changes · 84495f49
      Alex Elder authored
      Nothing too big here.
          - define the size of the buffer used for consuming ignored
            incoming data using a symbolic constant
          - simplify the condition determining whether to unmap the page
            in write_partial_msg_pages(): do it for crc but not if the
            page is the zero page
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      84495f49
    • Alex Elder's avatar
      libceph: small refactor in write_partial_kvec() · f42299e6
      Alex Elder authored
      Make a small change in the code that counts down kvecs consumed by
      a ceph_tcp_sendmsg() call.  Same functionality, just blocked out
      a little differently.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      f42299e6
    • Alex Elder's avatar
      libceph: do crc calculations outside loop · fe3ad593
      Alex Elder authored
      Move blocks of code out of loops in read_partial_message_section()
      and read_partial_message().  They were only was getting called at
      the end of the last iteration of the loop anyway.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      fe3ad593
    • Alex Elder's avatar
      libceph: separate CRC calculation from byte swapping · a9a0c51a
      Alex Elder authored
      Calculate CRC in a separate step from rearranging the byte order
      of the result, to improve clarity and readability.
      
      Use offsetof() to determine the number of bytes to include in the
      CRC calculation.
      
      In read_partial_message(), switch which value gets byte-swapped,
      since the just-computed CRC is already likely to be in a register.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      a9a0c51a
    • Alex Elder's avatar
      libceph: use "do" in CRC-related Boolean variables · bca064d2
      Alex Elder authored
      Change the name (and type) of a few CRC-related Boolean local
      variables so they contain the word "do", to distingish their purpose
      from variables used for holding an actual CRC value.
      
      Note that in the process of doing this I identified a fairly serious
      logic error in write_partial_msg_pages():  the value of "do_crc"
      assigned appears to be the opposite of what it should be.  No
      attempt to fix this is made here; this change preserves the
      erroneous behavior.  The problem I found is documented here:
          http://tracker.newdream.net/issues/2064Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      bca064d2
    • Alex Elder's avatar
      ceph: ensure Boolean options support both senses · cffaba15
      Alex Elder authored
      Many ceph-related Boolean options offer the ability to both enable
      and disable a feature.  For all those that don't offer this, add
      a new option so that they do.
      
      Note that ceph_show_options()--which reports mount options currently
      in effect--only reports the option if it is different from the
      default value.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      cffaba15
    • Alex Elder's avatar
      libceph: a few small changes · d3002b97
      Alex Elder authored
      This gathers a number of very minor changes:
          - use %hu when formatting the a socket address's address family
          - null out the ceph_msgr_wq pointer after the queue has been
            destroyed
          - drop a needless cast in ceph_write_space()
          - add a WARN() call in ceph_state_change() in the event an
            unrecognized socket state is encountered
          - rearrange the logic in ceph_con_get() and ceph_con_put() so
            that:
              - the reference counts are only atomically read once
      	- the values displayed via dout() calls are known to
      	  be meaningful at the time they are formatted
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      d3002b97
    • Alex Elder's avatar
      libceph: make ceph_tcp_connect() return int · 41617d0c
      Alex Elder authored
      There is no real need for ceph_tcp_connect() to return the socket
      pointer it creates, since it already assigns it to con->sock, which
      is visible to the caller.  Instead, have it return an error code,
      which tidies things up a bit.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      41617d0c
    • Alex Elder's avatar
      libceph: encapsulate some messenger cleanup code · 6173d1f0
      Alex Elder authored
      Define a helper function to perform various cleanup operations.  Use
      it both in the exit routine and in the init routine in the event of
      an error.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      6173d1f0
    • Alex Elder's avatar
      libceph: make ceph_msgr_wq private · e0f43c94
      Alex Elder authored
      The messenger workqueue has no need to be public.  So give it static
      scope.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      e0f43c94
    • Alex Elder's avatar
      libceph: encapsulate connection kvec operations · 859eb799
      Alex Elder authored
      Encapsulate the operation of adding a new chunk of data to the next
      open slot in a ceph_connection's out_kvec array.  Also add a "reset"
      operation to make subsequent add operations start at the beginning
      of the array again.
      
      Use these routines throughout, avoiding duplicate code and ensuring
      all calls are handled consistently.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      859eb799
    • Alex Elder's avatar
      libceph: move prepare_write_banner() · 963be4d7
      Alex Elder authored
      One of the arguments to prepare_write_connect() indicates whether it
      is being called immediately after a call to prepare_write_banner().
      Move the prepare_write_banner() call inside prepare_write_connect(),
      and reinterpret (and rename) the "after_banner" argument so it
      indicates that prepare_write_connect() should *make* the call
      rather than should know it has already been made.
      
      This was split out from the next patch to highlight this change in
      logic.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      963be4d7
    • Alex Elder's avatar
      rbd: don't drop the rbd_id too early · 32eec68d
      Alex Elder authored
      Currently an rbd device's id is released when it is removed, but it
      is done before the code is run to clean up sysfs-related files (such
      as /sys/bus/rbd/devices/1).
      
      It's possible that an rbd is still in use after the rbd_remove()
      call has been made.  It's essentially the same as an active inode
      that stays around after it has been removed--until its final close
      operation.  This means that the id shows up as free for reuse at a
      time it should not be.
      
      The effect of this was seen by Jens Rehpoehler, who:
          - had a filesystem mounted on an rbd device
          - unmapped that filesystem (without unmounting)
          - found that the mount still worked properly
          - but hit a panic when he attempted to re-map a new rbd device
      
      This re-map attempt found the previously-unmapped id available.
      The subsequent attempt to reuse it was met with a panic while
      attempting to (re-)install the sysfs entry for the new mapped
      device.
      
      Fix this by holding off "putting" the rbd id, until the rbd_device
      release function is called--when the last reference is finally
      dropped.
      
      Note: This fixes: http://tracker.newdream.net/issues/1907Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      32eec68d
    • Alex Elder's avatar
      rbd: small changes · 593a9e7b
      Alex Elder authored
      Here is another set of small code tidy-ups:
          - Define SECTOR_SHIFT and SECTOR_SIZE, and use these symbolic
            names throughout.  Tell the blk_queue system our physical
            block size, in the (unlikely) event we want to use something
            other than the default.
          - Delete the definition of struct rbd_info, which is never used.
          - Move the definition of dev_to_rbd() down in its source file,
            just above where it gets first used, and change its name to
            dev_to_rbd_dev().
          - Replace an open-coded operation in rbd_dev_release() to use
            dev_to_rbd_dev() instead.
          - Calculate the segment size for a given rbd_device just once in
            rbd_init_disk().
          - Use the '%zd' conversion specifier in rbd_snap_size_show(),
            since the value formatted is a size_t.
          - Switch to the '%llu' conversion specifier in rbd_snap_id_show().
            since the value formatted is unsigned.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      593a9e7b
    • Alex Elder's avatar
      rbd: do some refactoring · 00f1f36f
      Alex Elder authored
      A few blocks of code are rearranged a bit here:
          - In rbd_header_from_disk():
      	- Don't bother computing snap_count until we're sure the
      	  on-disk header starts with a good signature.
      	- Move a few independent lines of code so they are *after* a
      	  check for a failed memory allocation.
      	- Get rid of unnecessary local variable "ret".
          - Make a few other changes in rbd_read_header(), similar to the
            above--just moving things around a bit while preserving the
            functionality.
          - In rbd_rq_fn(), just assign rq in the while loop's controlling
            expression rather than duplicating it before and at the end of
            the loop body.  This allows the use of "continue" rather than
            "goto next" in a number of spots.
          - Rearrange the logic in snap_by_name().  End result is the same.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      00f1f36f
    • Alex Elder's avatar
      rbd: fix module sysfs setup/teardown code · fed4c143
      Alex Elder authored
      Once rbd_bus_type is registered, it allows an "add" operation via
      the /sys/bus/rbd/add bus attribute, and adding a new rbd device that
      way establishes a connection between the device and rbd_root_dev.
      But rbd_root_dev is not registered until after the rbd_bus_type
      registration is complete.  This could (in principle anyway) result
      in an invalid state.
      
      Since rbd_root_dev has no tie to rbd_bus_type we can reorder these
      two initializations and never be faced with this scenario.
      
      In addition, unregister the device in the event the bus registration
      fails at module init time.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      fed4c143
    • Alex Elder's avatar
      rbd: don't allocate mon_addrs buffer in rbd_add() · 7ef3214a
      Alex Elder authored
      The mon_addrs buffer in rbd_add is used to hold a copy of the
      monitor IP addresses supplied via /sys/bus/rbd/add.  That is
      passed to rbd_get_client(), which never modifies it (nor do
      any of the functions it gets passed to thereafter)--the mon_addr
      parameter to rbd_get_client() is a pointer to constant data, so it
      can't be modifed.  Furthermore, rbd_get_client() has the length of
      the mon_addrs buffer and that is used to ensure nothing goes beyond
      its end.
      
      Based on all this, there is no reason that a buffer needs to
      be used to hold a copy of the mon_addrs provided via
      /sys/bus/rbd/add.   Instead, the location within that passed-in
      buffer can be provided, along with the length of the "token"
      therein which represents the monitor IP's.
      
      A small change to rbd_add_parse_args() allows the address within the
      buffer to be passed back, and the length is already returned.  This
      now means that, at least from the perspective of this interface,
      there is no such thing as a list of monitor addresses that is too
      long.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      7ef3214a
    • Alex Elder's avatar
      rbd: have rbd_parse_args() report found mon_addrs size · 5214ecc4
      Alex Elder authored
      The argument parsing routine already computes the size of the
      mon_addrs buffer it extracts from the "command."  Pass it to the
      caller so it can use it to provide the length to rbd_get_client().
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      5214ecc4
    • Alex Elder's avatar
      rbd: do a few checks at build time · 81a89793
      Alex Elder authored
      This is a bit gratuitous, but there are a few things that can be
      verified at build time rather than run time, so do that.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      81a89793
    • Alex Elder's avatar
      rbd: don't use sscanf() in rbd_add_parse_args() · e28fff26
      Alex Elder authored
      Make use of a few simple helper routines to parse the arguments
      rather than sscanf().  This will treat both missing and too-long
      arguments as invalid input (rather than silently truncating the
      input in the too-long case).  In time this can also be used by
      rbd_add() to use the passed-in buffer in place, rather than copying
      its contents into new buffers.
      
      It appears to me that the sscanf() previously used would not
      correctly handle a supplied snapshot--the two final "%s" conversion
      specifications were not separated by a space, and I'm not sure
      how sscanf() handles that situation.  It may not be well-defined.
      So that may be a bug this change fixes (but I didn't verify that).
      
      The sizes of the mon_addrs and options buffers are now passed to
      rbd_add_parse_args(), so they can be supplied to copy_token().
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      e28fff26
    • Alex Elder's avatar
      rbd: encapsulate argument parsing for rbd_add() · a725f65e
      Alex Elder authored
      Move the code that parses the arguments provided to rbd_add() (which
      are supplied via /sys/bus/rbd/add) into a separate function.
      
      Also rename the "mon_dev_name" variable in rbd_add() to be
      "mon_addrs".   The variable represents a list of one or more
      comma-separated monitor IP addresses, each with an optional port
      number.  I think "mon_addrs" captures that notion a little better.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      a725f65e
    • Alex Elder's avatar
      rbd: simplify error handling in rbd_add() · 27cc2594
      Alex Elder authored
      If a couple pointers are initialized to NULL then a single
      "out_nomem" label can be used for all of the memory allocation
      failure cases in rbd_add().
      
      Also, get rid of the "irc" local variable there.  There is no
      real need for "rc" to be type ssize_t, and it can be used in
      the spot "irc" was.
      Signed-off-by: default avatarAlex Elder <elder@dreamhost.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      27cc2594