Commits · d83ed858f144e3cfbef883d4bc499113cdddabeb · nexedi / linux

05 Apr, 2014 15 commits

crush: add SET_CHOOSELEAF_VARY_R step · d83ed858

Ilya Dryomov authored Mar 19, 2014

This lets you adjust the vary_r tunable on a per-rule basis.

Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

d83ed858

crush: add chooseleaf_vary_r tunable · e2b149cc

Ilya Dryomov authored Mar 19, 2014

The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call.  That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.

Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.

Note that this was done from the get-go for the new crush_choose_indep
algorithm.

This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.

Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

e2b149cc

crush: allow crush rules to set (re)tries counts to 0 · 6ed1002f

Ilya Dryomov authored Mar 19, 2014

These two fields are misnomers; they are *retry* counts.

Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

6ed1002f

crush: fix off-by-one errors in total_tries refactor · 48a163db

Ilya Dryomov authored Mar 19, 2014

Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis. That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.

Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem. Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.

This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does. Inspection of the
crush debug output now aligns with prior versions, though.

Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

48a163db

ceph: don't include ceph.{file,dir}.layout vxattr in listxattr() · cc48c3e8

Yan, Zheng authored Mar 24, 2014

This avoids 'cp -a' modifying layout of new files/directories.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

cc48c3e8

ceph: check buffer size in ceph_vxattrcb_layout() · 1e5c6649

Yan, Zheng authored Mar 24, 2014

If buffer size is zero, return the size of layout vxattr. If buffer
size is not zero, check if it is large enough for layout vxattr.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

1e5c6649

ceph: fix null pointer dereference in discard_cap_releases() · 00bd8edb

Yan, Zheng authored Mar 24, 2014

send_mds_reconnect() may call discard_cap_releases() after all
release messages have been dropped by cleanup_cap_releases()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

00bd8edb

libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance() · d90deda6

Yan, Zheng authored Mar 23, 2014

When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

d90deda6

ceph: Remove get/set acl on symlinks · 5f75ce57

Fabian Frederick authored Mar 21, 2014

Remove unsupported symlink operations.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

5f75ce57

ceph: set mds_wanted when MDS reply changes a cap to auth cap · d9ffc4f7

Yan, Zheng authored Mar 18, 2014

When adjusting caps client wants, MDS does not record caps that are
not allowed. For non-auth MDS, it does not record WR caps. So when
a MDS reply changes a non-auth cap to auth cap, client needs to set
cap's mds_wanted according to the reply.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

d9ffc4f7

ceph: use fl->fl_file as owner identifier of flock and posix lock · eb13e832

Yan, Zheng authored Mar 09, 2014

flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).

The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.

The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

eb13e832

ceph: forbid mandatory file lock · eb70c0ce
Yan, Zheng authored Mar 04, 2014
```
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
```
eb70c0ce

ceph: use fl->fl_type to decide flock operation · 0e8e95d6

Yan, Zheng authored Mar 04, 2014

VFS does not directly pass flock's operation code to filesystem's
flock callback. It translates the operation code to the form how
posix lock's parameters are presented.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

0e8e95d6

ceph: update i_max_size even if inode version does not change · 8c93cd61

Yan, Zheng authored Mar 08, 2014

handle following sequence of events:
 - client releases a inode with i_max_size > 0. The release message
   is queued. (is not sent to the auth MDS)
 - a 'lookup' request reply from non-auth MDS returns the same inode.
 - client opens the inode in write mode. The version of inode trace
   in 'open' request reply is equal to the cached inode's version.
 - client requests new max size. The MDS ignores the request because
   it does not affect client's write range
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

8c93cd61

ceph: make sure write caps are registered with auth MDS · a2550604

Yan, Zheng authored Mar 08, 2014

Only auth MDS can issue write caps to clients, so don't consider
write caps registered with non-auth MDS as valid.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

a2550604

03 Apr, 2014 23 commits

ceph: print inode number for LOOKUPINO request · c137a32a

Yan, Zheng authored Mar 01, 2014

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

c137a32a

ceph: add get_name() NFS export callback · 19913b4e

Yan, Zheng authored Mar 06, 2014

Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

19913b4e

ceph: fix ceph_fh_to_parent() · 8996f4f2

Yan, Zheng authored Mar 01, 2014

ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field
of struct ceph_nfs_confh. This is wrong, it should return dentry that
corresponds to the 'parent_ino' field.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

8996f4f2

ceph: add get_parent() NFS export callback · 9017c2ec

Yan, Zheng authored Mar 01, 2014

The callback uses LOOKUPPARENT MDS request to find parent.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

9017c2ec

ceph: simplify ceph_fh_to_dentry() · 4f32b42d

Yan, Zheng authored Mar 01, 2014

MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way.
So __cfh_to_dentry() is redundant.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>

4f32b42d

ceph: fscache: Wait for completion of object initialization · f1fc4fee

Yunchuan Wen authored Dec 26, 2013

The object store limit needs to be updated after writing,
and this can be done provided the corresponding object has already
been initialized. Current object initialization is done asynchrously,
which introduce a race if a file is opened, then immediately followed
by a writing, the initialization may have not completed, the code will
reach the ASSERT in fscache_submit_exclusive_op() to cause kernel
bug.
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>

f1fc4fee

ceph: fscache: Update object store limit after file writing · 32d3e148

Yunchuan Wen authored Dec 26, 2013

Synchronize object->store_limit[_l] with new inode->i_size after file writing.
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>

32d3e148

ceph: fscache: add an interface to synchronize object store limit · 020c4bdd

Yunchuan Wen authored Dec 26, 2013

Add an interface to explicitly synchronize object->store_limit[_l]
with inode->i_size
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>

020c4bdd

ceph: do not set r_old_dentry_dir on link() · 4b58c9b1

Sage Weil authored Feb 05, 2013

This is racy--we do not know whather d_parent has changed out from
underneath us because i_mutex is not held on the source inode's directory.

Also, taking this reference is useless.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

4b58c9b1

ceph: do not assume r_old_dentry[_dir] always set together · 844d87c3

Sage Weil authored Feb 05, 2013

Do not assume that r_old_dentry implies that r_old_dentry_dir is also
true.  Separate out the ref cleanup and make the debugs dump behave when
it is NULL.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

844d87c3

ceph: do not chain inode updates to parent fsync · 752c8bdc

Sage Weil authored Feb 05, 2013

The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.
Reported-by: Al Viro <viro@xeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

752c8bdc

ceph: avoid useless ceph_get_dentry_parent_inode() in ceph_rename() · 180061a5

Sage Weil authored Feb 05, 2013

This is just old_dir; no reason to abuse the dcache pointers.

Reported-by: Al Viro <viro.zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

180061a5

ceph: let MDS adjust readdir 'frag' · 15289dc8

Yan, Zheng authored Mar 03, 2014

If readdir 'frag' is adjusted, readdir 'offset' should be reset.
Otherwise some dentries may be lost when readdir and fragmenting
directory happen at the some.

Another way to fix this issue is let MDS adjust readdir 'frag'.
The code that handles MDS reply reset the readdir 'offset' if
the readdir reply is different than the requested one.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

15289dc8

ceph: fix reset_readdir() · dcd3cc05

Yan, Zheng authored Feb 28, 2014

When changing readdir postion, fi->next_offset should be set to 0
if the new postion is not in the first dirfrag.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>

dcd3cc05

ceph: fix ceph_dir_llseek() · f0494206

Yan, Zheng authored Feb 27, 2014

Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for
directory. For a fragmented directory, offset (frag_t, off) can be
larger than inode->i_sb->s_maxbytes.

At the very beginning of ceph_dir_llseek(), local variable old_offset
is initialized to parameter offset. This doesn't make sense neither.
Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset).
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>

f0494206

rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op · 0ccd5926

Ilya Dryomov authored Feb 25, 2014

In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order).  Backwards compatibility is taken care
of on the libceph/osd side.

"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once.  The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into.  To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

0ccd5926

rbd: num_ops parameter for rbd_osd_req_create() · deb236b3

Ilya Dryomov authored Feb 25, 2014

In preparation for prefixing rbd writes with an allocation hint
introduce a num_ops parameter for rbd_osd_req_create().  The rationale
is that not every write request is a write op that needs to be prefixed
(e.g. watch op), so the num_ops logic needs to be in the callers.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

deb236b3

libceph: bump CEPH_OSD_MAX_OP to 3 · 7cc69d42

Ilya Dryomov authored Feb 25, 2014

Our longest osd request now contains 3 ops: copyup+hint+write.

Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was
hard-coded to 2.  Fix it, and switch to rbd_assert while at it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

7cc69d42

libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op · c647b8a8

Ilya Dryomov authored Feb 25, 2014

This is primarily for rbd's benefit and is supposed to combat
fragmentation:

"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks.  We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."

SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

c647b8a8

libceph: encode CEPH_OSD_OP_FLAG_* op flags · 7b25bf5f

Ilya Dryomov authored Feb 25, 2014

Encode ceph_osd_op::flags field so that it gets sent over the wire.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

7b25bf5f

rbd: fix error paths in rbd_img_request_fill() · 42dd037c

Ilya Dryomov authored Mar 04, 2014

Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is
not only insufficient, but also triggers an rbd_assert() in
rbd_obj_request_destroy():

    Assertion failure in rbd_obj_request_destroy() at line 1867:

    rbd_assert(obj_request->img_request == NULL);

rbd_img_obj_request_add() adds obj_requests to the img_request, the
opposite is rbd_img_obj_request_del().  Use it.

Fixes: http://tracker.ceph.com/issues/7327Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

42dd037c

rbd: remove out_partial label in rbd_img_request_fill() · 62054da6

Ilya Dryomov authored Mar 04, 2014

Commit 03507db6 ("rbd: fix buffer size for writes to images with
snapshots") moved the call to rbd_img_obj_request_add() up, making the
out_partial label bogus.  Remove it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

62054da6

libceph: a per-osdc crush scratch buffer · 9d521470

Ilya Dryomov authored Jan 31, 2014

With the addition of erasure coding support in the future, scratch
variable-length array in crush_do_rule_ary() is going to grow to at
least 200 bytes on average, on top of another 128 bytes consumed by
rawosd/osd arrays in the call chain. Replace it with a buffer inside
struct osdmap and a mutex. This shouldn't result in any contention,
because all osd requests were already serialized by request_mutex at
that point; the only unlocked caller was ceph_ioctl_get_dataloc().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

9d521470

31 Mar, 2014 2 commits

Linux 3.14 · 455c6fdb
Linus Torvalds authored Mar 30, 2014

455c6fdb

Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · fedc1ed0

Linus Torvalds authored Mar 30, 2014

Pull vfs fixes from Al Viro:
 "Switch mnt_hash to hlist, turning the races between __lookup_mnt() and
  hash modifications into false negatives from __lookup_mnt() (instead
  of hangs)"

On the false negatives from __lookup_mnt():
 "The *only* thing we care about is not getting stuck in __lookup_mnt().
  If it misses an entry because something in front of it just got moved
  around, etc, we are fine.  We'll notice that mount_lock mismatch and
  that'll be it"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch mnt_hash to hlist
  don't bother with propagate_mnt() unless the target is shared
  keep shadowed vfsmounts together
  resizable namespace.c hashes

fedc1ed0