Commits · 793333a303c90174c317e3fa12e898bbc02daee4 · Kirill Smelkov / linux

08 Jul, 2019 40 commits

rbd: introduce copyup state machine · 793333a3

Ilya Dryomov authored Jun 13, 2019

Both write and copyup paths will get more complex with object map.
Factor copyup code out into a separate state machine.

While at it, take advantage of obj_req->osd_reqs list and issue empty
and current snapc OSD requests together, one after another.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

793333a3

rbd: rename rbd_obj_setup_*() to rbd_obj_init_*() · ea9b743c

Ilya Dryomov authored May 31, 2019

These functions don't allocate and set up OSD requests anymore.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

ea9b743c

rbd: move OSD request allocation into object request state machines · a086a1b8

Ilya Dryomov authored Jun 12, 2019

Following submission, move initial OSD request allocation into object
request state machines.  Everything that has to do with OSD requests is
now handled inside the state machine, all __rbd_img_fill_request() has
left is initialization.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

a086a1b8

rbd: factor out __rbd_osd_setup_discard_ops() · 27bbd911

Ilya Dryomov authored May 29, 2019

With obj_req->xferred removed, obj_req->ex.oe_off and obj_req->ex.oe_len
can be updated if required for alignment. Previously the new offset and
length weren't stored anywhere beyond rbd_obj_setup_discard().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

27bbd911

rbd: factor out rbd_osd_setup_copyup() · b5ae8cbc

Ilya Dryomov authored May 29, 2019

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

b5ae8cbc

rbd: introduce obj_req->osd_reqs list · bcbab1db

Ilya Dryomov authored May 27, 2019

Since the dawn of time it had been assumed that a single object request
spawns a single OSD request. This is already impacting copyup: instead
of sending empty and current snapc copyups together, we wait for empty
snapc OSD request to complete in order to reassign obj_req->osd_req
with current snapc OSD request. Looking further, updating potentially
hundreds of snapshot object maps serially is a non-starter.

Replace obj_req->osd_req pointer with obj_req->osd_reqs list. Use
osd_req->r_private_item for linkage.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

bcbab1db

libceph: rename r_unsafe_item to r_private_item · 94e85771

Ilya Dryomov authored Jul 08, 2019

This list item remained from when we had safe and unsafe replies
(commit vs ack).  It has since become a private list item for use by
clients.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

94e85771

rbd: introduce image request state machine · 0192ce2e

Ilya Dryomov authored May 16, 2019

Make it possible to schedule image requests on a workqueue.  This fixes
parent chain recursion added in the previous commit and lays the ground
for exclusive lock wait/wake improvements.

The "wait for pending subrequests and report first nonzero result" code
is generalized to be used by object request state machine.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

0192ce2e

rbd: move OSD request submission into object request state machines · 85b5e6d1

Ilya Dryomov authored May 14, 2019

Start eliminating asymmetry where the initial OSD request is allocated
and submitted from outside the state machine, making error handling and
restarts harder than they could be.  This commit deals with submission,
a commit that deals with allocation will follow.

Note that this commit adds parent chain recursion on the submission
side:

  rbd_img_request_submit
    rbd_obj_handle_request
      __rbd_obj_handle_request
        rbd_obj_handle_read
          rbd_obj_handle_write_guard
            rbd_obj_read_from_parent
              rbd_img_request_submit

This will be fixed in the next commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

85b5e6d1

rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD} · 0ad5d953

Ilya Dryomov authored May 14, 2019

In preparation for moving OSD request allocation and submission into
object request state machines, get rid of RBD_OBJ_WRITE_{FLAT,GUARD}.
We would need to start in a new state, whether the request is guarded
or not.  Unify them into RBD_OBJ_WRITE_OBJECT and pass guard info
through obj_req->flags.

While at it, make our ENOENT handling a little more precise: only hide
ENOENT when it is actually expected, that is on delete.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

0ad5d953

rbd: replace obj_req->tried_parent with obj_req->read_state · a9b67e69

Ilya Dryomov authored May 08, 2019

Make rbd_obj_handle_read() look like a state machine and get rid of
the necessity to patch result in rbd_obj_handle_request(), completing
the removal of obj_req->xferred and img_req->xferred.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

a9b67e69

rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred · 54ab3b24

Ilya Dryomov authored May 11, 2019

obj_req->xferred and img_req->xferred don't bring any value. The
former is used for short reads and has to be set to obj_req->ex.oe_len
after that and elsewhere. The latter is just an aggregate.

Use result for short reads (>=0 - number of bytes read, <0 - error) and
pass it around explicitly. No need to store it in obj_req.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

54ab3b24

ceph: don't NULL terminate virtual xattrs · 26350535

Jeff Layton authored Jun 24, 2019

The convention with xattrs is to not store the termination with string
data, given that it returns the length. This is how setfattr/getfattr
operate.

Most of ceph's virtual xattr routines use snprintf to plop the string
directly into the destination buffer, but snprintf always NULL
terminates the string. This means that if we send the kernel a buffer
that is the exact length needed to hold the string, it'll end up
truncated.

Add a ceph_fmt_xattr helper function to format the string into an
on-stack buffer that should always be large enough to hold the whole
thing and then memcpy the result into the destination buffer. If it does
turn out that the formatted string won't fit in the on-stack buffer,
then return -E2BIG and do a WARN_ONCE().

Change over most of the virtual xattr routines to use the new helper. A
couple of the xattrs are sourced from strings however, and it's
difficult to know how long they'll be. Just have those memcpy the result
in place after verifying the length.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

26350535

ceph: return -ERANGE if virtual xattr value didn't fit in buffer · 3b421018

Jeff Layton authored Jun 13, 2019

The getxattr manpage states that we should return ERANGE if the
destination buffer size is too small to hold the value.
ceph_vxattrcb_layout does this internally, but we should be doing
this for all vxattrs.

Fix the only caller of getxattr_cb to check the returned size
against the buffer length and return -ERANGE if it doesn't fit.
Drop the same check in ceph_vxattrcb_layout and just rely on the
caller to handle it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

3b421018

ceph: make getxattr_cb return ssize_t · f1d1b51d

Jeff Layton authored Jun 24, 2019

The getxattr_cb functions return size_t, which is unsigned and then
cast that value to int and then ssize_t before returning it. While all
of this works, it relies on implicit casting rules for signed/unsigned
conversions.

Change getxattr_cb to return ssize_t to better conform with what the
caller actually wants. Also, remove some suspicious casts.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

f1d1b51d

ceph: more precise CEPH_CLIENT_CAPS_PENDING_CAPSNAP · 49ada6e8

Yan, Zheng authored Jun 20, 2019

Client uses this flag to tell mds if there is more cap snap need to
flush. It's mainly for the case that client needs to re-send cap/snap
flushes after mds failover, but CEPH_CAP_ANY_FILE_WR on corresponding
inodes are all released before mds failover.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

49ada6e8

ceph: kick flushing and flush snaps before sending normal cap message · d6cee9db

Yan, Zheng authored Jun 20, 2019

Otherwise client may send cap flush messages in wrong order.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d6cee9db

ceph: clear CEPH_I_KICK_FLUSH flag inside __kick_flushing_caps() · 054f8d41

Yan, Zheng authored Jun 20, 2019

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

054f8d41

ceph: increment change_attribute on local changes · 5c308356

Jeff Layton authored Jun 06, 2019

We don't set SB_I_VERSION on ceph since we need to manage it ourselves,
so we must increment it whenever we update the file times.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

5c308356

ceph: handle change_attr in cap messages · 176c77c9

Jeff Layton authored Jun 06, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

176c77c9

ceph: add change_attr field to ceph_inode_info · a35ead31

Jeff Layton authored Jun 06, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a35ead31

iversion: add a routine to update a raw value with a larger one · 441d3676

Jeff Layton authored Jun 05, 2019

Under ceph, clients can be independently updating iversion themselves,
while working under comprehensive sets of caps on an inode. In that
situation we always want to prefer the largest value of a change
attribute. Add a new function that will update a raw value with a larger
one, but otherwise leave it alone.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

441d3676

ceph: allow querying of STATX_BTIME in ceph_getattr · 58981784

Jeff Layton authored Jun 05, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

58981784

libceph: turn on CEPH_FEATURE_MSG_ADDR2 · 6adaaafd

Jeff Layton authored May 31, 2019

Now that the client can handle either address formatting, advertise to
the peer that we can support it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6adaaafd

ceph: handle btime in cap messages · ec62b894

Jeff Layton authored May 29, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ec62b894

ceph: add btime field to ceph_inode_info · 245ce991

Jeff Layton authored May 29, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

245ce991

libceph: rename ceph_encode_addr to ceph_encode_banner_addr · 2c66de56

Jeff Layton authored Jun 17, 2019

...ditto for the decode function. We only use these functions to fix
up banner addresses now, so let's name them more appropriately.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

2c66de56

libceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONE · d3c3c0a8

Jeff Layton authored Jun 17, 2019

Going forward, we'll have different address types so let's use
the addr2 TYPE_LEGACY for internal tracking rather than TYPE_NONE.

Also, make ceph_pr_addr print the address type value as well.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d3c3c0a8

ceph: fix decode_locker to use ceph_decode_entity_addr · 2f9800c8

Jeff Layton authored Jun 04, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

2f9800c8

ceph: have MDS map decoding use entity_addr_t decoder · f3848af1

Jeff Layton authored Jun 04, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

f3848af1

libceph: correctly decode ADDR2 addresses in incremental OSD maps · 8cb5f2b4

Jeff Layton authored Jun 04, 2019

Given the new format, we have to decode the addresses twice. Once to
skip past the new_up_client field, and a second time to collect the
addresses.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

8cb5f2b4

libceph: fix watch_item_t decoding to use ceph_decode_entity_addr · 51fc7ab4

Jeff Layton authored Jun 04, 2019

While we're in there, let's also fix up the decoder to do proper
bounds checking.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

51fc7ab4

libceph: switch osdmap decoding to use ceph_decode_entity_addr · dcbc919a

Jeff Layton authored Jun 03, 2019

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

dcbc919a

libceph: ADDR2 support for monmap · 0bfb0f28

Jeff Layton authored May 31, 2019

Switch the MonMap decoder to use the new decoding routine for
entity_addr_t's.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

0bfb0f28

libceph: add ceph_decode_entity_addr · 6c37f0e6

Jeff Layton authored Jun 03, 2019

Add a function for decoding an entity_addr_t. Once
CEPH_FEATURE_MSG_ADDR2 is enabled, the server daemons will start
encoding entity_addr_t differently.

Add a new helper function that can handle either format.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6c37f0e6

libceph: fix sa_family just after reading address · bc07532c

Jeff Layton authored Jun 04, 2019

It doesn't make sense to leave it undecoded until later.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

bc07532c

ceph: remove request from waiting list before unregister · 428138c9

Yan, Zheng authored Jun 14, 2019

Link: https://tracker.ceph.com/issues/40339Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

428138c9

ceph: don't blindly unregister session that is in opening state · 6f0f597b

Yan, Zheng authored Jun 10, 2019

handle_cap_export() may add placeholder caps to session that is in
opening state. These caps' session pointer become wild after session get
unregistered.

The fix is not to unregister session in opening state during mds failovers,
just let client to reconnect later when mds is recovered.

Link: https://tracker.ceph.com/issues/40190Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6f0f597b

ceph: fix infinite loop in get_quota_realm() · 2ef5df1a

Yan, Zheng authored May 31, 2019

get_quota_realm() enters infinite loop if quota inode has no caps.
This can happen after client gets evicted.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

2ef5df1a

ceph: add selinux support · ac6713cc

Yan, Zheng authored May 26, 2019

When creating new file/directory, use security_dentry_init_security() to
prepare selinux context for the new inode, then send openc/mkdir request
to MDS, together with selinux xattr.

security_dentry_init_security() only supports single security module and
only selinux has dentry_init_security hook. So only selinux is supported
for now. We can add support for other security modules once kernel has a
generic version of dentry_init_security()
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ac6713cc