Commits · 86f1742b94dd0b4a2eb9255205d1756ddea182f8 · nexedi / linux

05 Apr, 2014 26 commits

libceph: fixup error handling in osdmap_apply_incremental() · 86f1742b

Ilya Dryomov authored Mar 13, 2014

The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro. This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset. Follow osdmap_decode() and fix this by adding
a special e_inval label to be used by all ceph_decode_* macros.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

86f1742b

libceph: fix crush_decode() call site in osdmap_decode() · 9902e682

Ilya Dryomov authored Mar 13, 2014

The size of the memory area feeded to crush_decode() should be limited
not only by osdmap end, but also by the crush map length.  Also, drop
unnecessary dout() (dout() in crush_decode() conveys the same info) and
step past crush map only if it is decoded successfully.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

9902e682

libceph: check length of osdmap osd arrays · 2d88b2e0

Ilya Dryomov authored Mar 13, 2014

Check length of osd_state, osd_weight and osd_addr arrays.  They
should all have exactly max_osd elements after the call to
osdmap_set_max_osd().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

2d88b2e0

libceph: safely decode max_osd value in osdmap_decode() · 3977058c

Ilya Dryomov authored Mar 13, 2014

max_osd value is not covered by any ceph_decode_need().  Use a safe
version of ceph_decode_* macro to decode it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

3977058c

libceph: fixup error handling in osdmap_decode() · 597b52f6

Ilya Dryomov authored Mar 13, 2014

The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro.  This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset.  Fix this by adding a special e_inval label to
be used by all ceph_decode_* macros.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

597b52f6

libceph: split osdmap allocation and decode steps · a2505d63

Ilya Dryomov authored Mar 13, 2014

Split osdmap allocation and initialization into a separate function,
ceph_osdmap_decode().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

a2505d63

libceph: dump osdmap and enhance output on decode errors · 38a8d560

Ilya Dryomov authored Mar 13, 2014

Dump osdmap in hex on both full and incremental decode errors, to make
it easier to match the contents with error offset.  dout() map epoch
and max_osd value on success.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

38a8d560

libceph: dump pg_temp mappings to debugfs · 1c00240e

Ilya Dryomov authored Mar 13, 2014

Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g:

    pg_temp 2.6 [2,3,4]
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

1c00240e

libceph: do not prefix osd lines with \t in debugfs output · 0a2800d7

Ilya Dryomov authored Mar 13, 2014

To save screen space in anticipation of more fields (e.g. primary
affinity).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

0a2800d7

libceph: refer to osdmap directly in osdmap_show() · 35fea3a1

Ilya Dryomov authored Mar 13, 2014

To make it more readable and save screen space.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

35fea3a1

crush: support chooseleaf_vary_r tunable (tunables3) by default · 07bd7de4

Ilya Dryomov authored Mar 19, 2014

Add TUNABLES3 feature (chooseleaf_vary_r tunable) to a set of features
supported by default.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

07bd7de4

crush: add SET_CHOOSELEAF_VARY_R step · d83ed858

Ilya Dryomov authored Mar 19, 2014

This lets you adjust the vary_r tunable on a per-rule basis.

Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

d83ed858

crush: add chooseleaf_vary_r tunable · e2b149cc

Ilya Dryomov authored Mar 19, 2014

The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call.  That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.

Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.

Note that this was done from the get-go for the new crush_choose_indep
algorithm.

This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.

Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

e2b149cc

crush: allow crush rules to set (re)tries counts to 0 · 6ed1002f

Ilya Dryomov authored Mar 19, 2014

These two fields are misnomers; they are *retry* counts.

Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

6ed1002f

crush: fix off-by-one errors in total_tries refactor · 48a163db

Ilya Dryomov authored Mar 19, 2014

Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis. That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.

Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem. Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.

This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does. Inspection of the
crush debug output now aligns with prior versions, though.

Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

48a163db

ceph: don't include ceph.{file,dir}.layout vxattr in listxattr() · cc48c3e8

Yan, Zheng authored Mar 24, 2014

This avoids 'cp -a' modifying layout of new files/directories.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

cc48c3e8

ceph: check buffer size in ceph_vxattrcb_layout() · 1e5c6649

Yan, Zheng authored Mar 24, 2014

If buffer size is zero, return the size of layout vxattr. If buffer
size is not zero, check if it is large enough for layout vxattr.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

1e5c6649

ceph: fix null pointer dereference in discard_cap_releases() · 00bd8edb

Yan, Zheng authored Mar 24, 2014

send_mds_reconnect() may call discard_cap_releases() after all
release messages have been dropped by cleanup_cap_releases()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

00bd8edb

libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance() · d90deda6

Yan, Zheng authored Mar 23, 2014

When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

d90deda6

ceph: Remove get/set acl on symlinks · 5f75ce57

Fabian Frederick authored Mar 21, 2014

Remove unsupported symlink operations.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

5f75ce57

ceph: set mds_wanted when MDS reply changes a cap to auth cap · d9ffc4f7

Yan, Zheng authored Mar 18, 2014

When adjusting caps client wants, MDS does not record caps that are
not allowed. For non-auth MDS, it does not record WR caps. So when
a MDS reply changes a non-auth cap to auth cap, client needs to set
cap's mds_wanted according to the reply.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

d9ffc4f7

ceph: use fl->fl_file as owner identifier of flock and posix lock · eb13e832

Yan, Zheng authored Mar 09, 2014

flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).

The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.

The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

eb13e832

ceph: forbid mandatory file lock · eb70c0ce
Yan, Zheng authored Mar 04, 2014
```
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
```
eb70c0ce

ceph: use fl->fl_type to decide flock operation · 0e8e95d6

Yan, Zheng authored Mar 04, 2014

VFS does not directly pass flock's operation code to filesystem's
flock callback. It translates the operation code to the form how
posix lock's parameters are presented.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

0e8e95d6

ceph: update i_max_size even if inode version does not change · 8c93cd61

Yan, Zheng authored Mar 08, 2014

handle following sequence of events:
 - client releases a inode with i_max_size > 0. The release message
   is queued. (is not sent to the auth MDS)
 - a 'lookup' request reply from non-auth MDS returns the same inode.
 - client opens the inode in write mode. The version of inode trace
   in 'open' request reply is equal to the cached inode's version.
 - client requests new max size. The MDS ignores the request because
   it does not affect client's write range
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

8c93cd61

ceph: make sure write caps are registered with auth MDS · a2550604

Yan, Zheng authored Mar 08, 2014

Only auth MDS can issue write caps to clients, so don't consider
write caps registered with non-auth MDS as valid.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

a2550604

03 Apr, 2014 14 commits

ceph: print inode number for LOOKUPINO request · c137a32a

Yan, Zheng authored Mar 01, 2014

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

c137a32a

ceph: add get_name() NFS export callback · 19913b4e

Yan, Zheng authored Mar 06, 2014

Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

19913b4e

ceph: fix ceph_fh_to_parent() · 8996f4f2

Yan, Zheng authored Mar 01, 2014

ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field
of struct ceph_nfs_confh. This is wrong, it should return dentry that
corresponds to the 'parent_ino' field.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

8996f4f2

ceph: add get_parent() NFS export callback · 9017c2ec

Yan, Zheng authored Mar 01, 2014

The callback uses LOOKUPPARENT MDS request to find parent.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

9017c2ec

ceph: simplify ceph_fh_to_dentry() · 4f32b42d

Yan, Zheng authored Mar 01, 2014

MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way.
So __cfh_to_dentry() is redundant.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

4f32b42d

ceph: fscache: Wait for completion of object initialization · f1fc4fee

Yunchuan Wen authored Dec 26, 2013

The object store limit needs to be updated after writing,
and this can be done provided the corresponding object has already
been initialized. Current object initialization is done asynchrously,
which introduce a race if a file is opened, then immediately followed
by a writing, the initialization may have not completed, the code will
reach the ASSERT in fscache_submit_exclusive_op() to cause kernel
bug.
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>

f1fc4fee

ceph: fscache: Update object store limit after file writing · 32d3e148

Yunchuan Wen authored Dec 26, 2013

Synchronize object->store_limit[_l] with new inode->i_size after file writing.
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>

32d3e148

ceph: fscache: add an interface to synchronize object store limit · 020c4bdd

Yunchuan Wen authored Dec 26, 2013

Add an interface to explicitly synchronize object->store_limit[_l]
with inode->i_size
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>

020c4bdd

ceph: do not set r_old_dentry_dir on link() · 4b58c9b1

Sage Weil authored Feb 05, 2013

This is racy--we do not know whather d_parent has changed out from
underneath us because i_mutex is not held on the source inode's directory.

Also, taking this reference is useless.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

4b58c9b1

ceph: do not assume r_old_dentry[_dir] always set together · 844d87c3

Sage Weil authored Feb 05, 2013

Do not assume that r_old_dentry implies that r_old_dentry_dir is also
true.  Separate out the ref cleanup and make the debugs dump behave when
it is NULL.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

844d87c3

ceph: do not chain inode updates to parent fsync · 752c8bdc

Sage Weil authored Feb 05, 2013

The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.
Reported-by: Al Viro <viro@xeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

752c8bdc

ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename() · 180061a5

Sage Weil authored Feb 05, 2013

This is just old_dir; no reason to abuse the dcache pointers.

Reported-by: Al Viro <viro.zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

180061a5

ceph: let MDS adjust readdir 'frag' · 15289dc8

Yan, Zheng authored Mar 03, 2014

If readdir 'frag' is adjusted, readdir 'offset' should be reset.
Otherwise some dentries may be lost when readdir and fragmenting
directory happen at the some.

Another way to fix this issue is let MDS adjust readdir 'frag'.
The code that handles MDS reply reset the readdir 'offset' if
the readdir reply is different than the requested one.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

15289dc8

ceph: fix reset_readdir() · dcd3cc05

Yan, Zheng authored Feb 28, 2014

When changing readdir postion, fi->next_offset should be set to 0
if the new postion is not in the first dirfrag.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>

dcd3cc05