Commits · 125d725c923527a85876c031028c7f55c28b74b3 · Kirill Smelkov / linux

28 Jan, 2014 2 commits

ceph: cast PAGE_SIZE to size_t in ceph_sync_write() · 125d725c

Ilya Dryomov authored Jan 28, 2014

Use min_t(size_t, ...) instead of plain min(), which does strict type
checking, to avoid compile warning on i386.

Cc: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

125d725c

ceph: fix dout() compile warnings in ceph_filemap_fault() · 37b52fe6

Ilya Dryomov authored Jan 28, 2014

PAGE_CACHE_SIZE is unsigned long on all architectures, however size_t
is either unsigned int or unsigned long.  Rather than change format
strings, cast PAGE_CACHE_SIZE to size_t to be in line with dout()s in
ceph_page_mkwrite().

Cc: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

37b52fe6

27 Jan, 2014 11 commits

libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature · 80e163a5

Ilya Dryomov authored Jan 27, 2014

Announce our (limited, see previous commit) support for CACHEPOOL
feature.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

80e163a5

libceph: follow redirect replies from osds · 205ee118

Ilya Dryomov authored Jan 27, 2014

Follow redirect replies from osds, for details see ceph.git commit
fbbe3ad1220799b7bb00ea30fce581c5eadaf034.

v1 (current) version of redirect reply consists of oloc and oid, which
expands to pool, key, nspace, hash and oid.  However, server-side code
that would populate anything other than pool doesn't exist yet, and
hence this commit adds support for pool redirects only.  To make sure
that future server-side updates don't break us, we decode all fields
and, if any of key, nspace, hash or oid have a non-default value, error
out with "corrupt osd_op_reply ..." message.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

205ee118

libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} · 3c972c95

Ilya Dryomov authored Jan 27, 2014

Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

3c972c95

libceph: follow {read,write}_tier fields on osd request submission · 17a13e40

Ilya Dryomov authored Jan 27, 2014

Overwrite ceph_osd_request::r_oloc.pool with read_tier for read ops and
write_tier for write and read+write ops (aka basic tiering support).
{read,write}_tier are part of pg_pool_t since v9.  This commit bumps
our pg_pool_t decode compat version from v7 to v9, all new fields
except for {read,write}_tier are ignored.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

17a13e40

libceph: add ceph_pg_pool_by_id() · ce7f6a27

Ilya Dryomov authored Jan 27, 2014

"Lookup pool info by ID" function is hidden in osdmap.c.  Expose it to
the rest of libceph.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

ce7f6a27

libceph: CEPH_OSD_FLAG_* enum update · 1b3f2ab5

Ilya Dryomov authored Jan 27, 2014

Update CEPH_OSD_FLAG_* enum.  (We need CEPH_OSD_FLAG_IGNORE_OVERLAY to
support tiering).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

1b3f2ab5

libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg() · 7c13cb64

Ilya Dryomov authored Jan 27, 2014

Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

7c13cb64

libceph: introduce and start using oid abstraction · 4295f221

Ilya Dryomov authored Jan 27, 2014

In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

4295f221

libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN · 2d0ebc5d

Ilya Dryomov authored Jan 27, 2014

In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

2d0ebc5d

libceph: move ceph_file_layout helpers to ceph_fs.h · e8221464

Ilya Dryomov authored Jan 27, 2014

Move ceph_file_layout helper macros and inline functions to ceph_fs.h.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

e8221464

libceph: start using oloc abstraction · 22116525

Ilya Dryomov authored Jan 27, 2014

Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction.  Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool.  This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.

This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

22116525

26 Jan, 2014 2 commits

libceph: dout() is missing a newline · 0b4af2e8

Ilya Dryomov authored Jan 16, 2014

Add a missing newline to a dout() in __reset_osd().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

0b4af2e8

libceph: add ceph_kv{malloc,free}() and switch to them · eeb0bed5

Ilya Dryomov authored Jan 09, 2014

Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.

ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set.  This changes the existing
behaviour:

- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
  and using vmalloc() just as a fallback

- for messages (ceph_msg_new()), from going to vmalloc() for anything
  bigger than a page

- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
  memory
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

eeb0bed5

21 Jan, 2014 11 commits

libceph: support CEPH_FEATURE_EXPORT_PEER · 80213a84
Yan, Zheng authored Jan 21, 2014
```
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
```
80213a84

ceph: add imported caps when handling cap export message · 11df2dfb

Yan, Zheng authored Nov 24, 2013

Version 3 cap export message includes information about the imported
caps. It allows us to add the imported caps if the corresponding cap
import message still hasn't been received.

This allow us to handle situation that the importer MDS crashes and
the cap import message is missing.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

11df2dfb

ceph: add open export target session helper · 5d72d13c
Yan, Zheng authored Nov 24, 2013
```
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
```
5d72d13c

ceph: remove exported caps when handling cap import message · 4ee6a914

Yan, Zheng authored Nov 24, 2013

Version 3 cap import message includes the ID of the exported
caps. It allow us to remove the exported caps if we still haven't
received the corresponding cap export message.

We remove the exported caps because they are stale, keeping them
can compromise consistence.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

4ee6a914

ceph: handle session flush message · 186e4f7a
Yan, Zheng authored Nov 22, 2013
```
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
```
186e4f7a

ceph: check inode caps in ceph_d_revalidate · 9215aeea

Yan, Zheng authored Nov 30, 2013

Some inodes in readdir reply may have no caps. Getattr mds request
for these inodes can return -ESTALE. The fix is consider dentry that
links to inode with no caps as invalid. Invalid dentry causes a
lookup request to send to the mds, the MDS will send caps back.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

9215aeea

ceph: handle -ESTALE reply · ca18bede

Yan, Zheng authored Nov 22, 2013

Send requests that operate on path to directory's auth MDS if
mode == USE_AUTH_MDS. Always retry using the auth MDS if got
-ESTALE reply from non-auth MDS. Also clean up the code that
handles auth MDS change.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

ca18bede

ceph: fix trim caps · 979abfdd

Yan, Zheng authored Nov 22, 2013

- don't trim auth cap if there are flusing caps
- don't trim auth cap if any 'write' cap is wanted
- allow trimming non-auth cap even if the inode is dirty
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

979abfdd

ceph: fix cache revoke race · 9563f88c

Yan, Zheng authored Nov 22, 2013

handle following sequence of events:

- non-auth MDS revokes Fc cap. queue invalidate work
- auth MDS issues Fc cap through request reply. i_rdcache_gen gets
  increased.
- invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
  so it does nothing.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

9563f88c

ceph: use ceph_seq_cmp() to compare migrate_seq · d1b87809
Yan, Zheng authored Nov 13, 2013
```
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
```
d1b87809

ceph: handle cap export race in try_flush_caps() · 4fe59789

Yan, Zheng authored Oct 31, 2013

auth cap may change after releasing the i_ceph_lock
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

4fe59789

17 Jan, 2014 1 commit

ceph: trivial comment fix · fc12c80a

J. Bruce Fields authored Jan 16, 2014

"disconnected" is too easily confused with "DCACHE_DISCONNECTED".  I
think "unhashed" is the more precise term here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Sage Weil <sage@inktank.com>

fc12c80a

14 Jan, 2014 3 commits

libceph: fix preallocation check in get_reply() · f2be82b0

Ilya Dryomov authored Jan 09, 2014

The check that makes sure that we have enough memory allocated to read
in the entire header of the message in question is currently busted.
It compares front_len of the incoming message with iov_len field of
ceph_msg::front structure, which is used primarily to indicate the
amount of data already read in, and not the size of the allocated
buffer.  Under certain conditions (e.g. a short read from a socket
followed by that socket's shutdown and owning ceph_connection reset)
this results in a warning similar to

[85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0)

and, through another bug, leads to forever hung tasks and forced
reboots.  Fix this by comparing front_len with front_alloc_len field of
struct ceph_msg, which stores the actual size of the buffer.

Fixes: http://tracker.ceph.com/issues/5425Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

f2be82b0

libceph: rename front to front_len in get_reply() · 3f0a4ac5

Ilya Dryomov authored Jan 09, 2014

Rename front local variable to front_len in get_reply() to make its
purpose more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

3f0a4ac5

libceph: rename ceph_msg::front_max to front_alloc_len · 3cea4c30

Ilya Dryomov authored Jan 09, 2014

Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

3cea4c30

31 Dec, 2013 10 commits

libceph: use CEPH_MON_PORT when the specified port is 0 · f48db1e9

Ilya Dryomov authored Dec 30, 2013

Similar to userspace, don't bail with "parse_ips bad ip ..." if the
specified port is port 0, instead use port CEPH_MON_PORT (6789, the
default monitor port).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

f48db1e9

crush: support new indep mode and SET_* steps (crush v2) by default · cdff4991

Ilya Dryomov authored Dec 24, 2013

Add CRUSH_V2 feature (new indep mode and SET_* steps) to a set of
features supported by default.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

cdff4991

crush: fix crush_choose_firstn comment · 0e32d712

Ilya Dryomov authored Dec 24, 2013

Reflects ceph.git commit 8b38f10bc2ee3643a33ea5f9545ad5c00e4ac5b4.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

0e32d712

crush: attempts -> tries · 2d8be0bc

Ilya Dryomov authored Dec 24, 2013

Reflects ceph.git commit ea3a0bb8b773360d73b8b77fa32115ef091c9857.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

2d8be0bc

crush: add set_choose_local_[fallback_]tries steps · f046bf92

Ilya Dryomov authored Dec 24, 2013

This allows all of the tunables to be overridden by a specific rule.

Reflects ceph.git commits d129e09e57fbc61cfd4f492e3ee77d0750c9d292,
                          0497db49e5973b50df26251ed0e3f4ac7578e66e.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

f046bf92

crush: generalize descend_once · d390bb2a

Ilya Dryomov authored Dec 24, 2013

The legacy behavior is to make the normal number of tries for the
recursive chooseleaf call.  The descend_once tunable changed this to
making a single try and bail if we get a reject (note that it is
impossible to collide in the recursive case).

The new set_chooseleaf_tries lets you select the number of recursive
chooseleaf attempts for indep mode, or default to 1.  Use the same
behavior for firstn, except default to total_tries when the legacy
tunables are set (for compatibility).  This makes the rule step
override the (new) default of 1 recursive attempt, keeping behavior
consistent with indep mode.

Reflects ceph.git commit 685c6950ef3df325ef04ce7c986e36ca2514c5f1.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

d390bb2a

crush: CHOOSE_LEAF -> CHOOSELEAF throughout · 917edad5

Ilya Dryomov authored Dec 24, 2013

This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.

Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

917edad5

crush: add SET_CHOOSE_TRIES rule step · cc10df4a

Ilya Dryomov authored Dec 24, 2013

Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.

Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

cc10df4a

crush: apply chooseleaf_tries to firstn mode too · f18650ac

Ilya Dryomov authored Dec 24, 2013

Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well. Note that we have
slightly different behavior here than with indep:

If the firstn value is not specified for firstn, we pass through the
normal attempt count. This maintains compatibility with legacy behavior.
Note that this is usually *not* actually N^2 work, though, because of the
descend_once tunable. However, descend_once is unfortunately *not* the
same thing as 1 chooseleaf try because it is only checked on a reject but
not on a collision. Sigh.

In contrast, for indep, if tries is not specified we default to 1
recursive attempt, because that is simply more sane, and we have the
option to do so. The descend_once tunable has no effect for indep.

Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

f18650ac

crush: new SET_CHOOSE_LEAF_TRIES command · be3226ac

Ilya Dryomov authored Dec 24, 2013

Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).

(We should do the same for the other tunables, by the way!)

Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

be3226ac