Commits · 27263e28321db438bc43dc0c0be432ce91526224 · nexedi / linux

16 Jan, 2012 23 commits

Merge branch 'restriper' of git://github.com/idryomov/btrfs-unstable into integration · 27263e28
Chris Mason authored Jan 16, 2012

27263e28
Merge branch 'allocation-fixes' into integration · 64e05503
Chris Mason authored Jan 16, 2012

64e05503
Btrfs: add balance progress reporting · 19a39dce
Ilya Dryomov authored Jan 16, 2012
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
19a39dce

Btrfs: allow for resuming restriper after it was paused · de322263

Ilya Dryomov authored Jan 16, 2012

Recognize BTRFS_BALANCE_RESUME flag passed from userspace.  We use the
same heuristics used when recovering balance after a crash to try to
start where we left off last time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

de322263

Btrfs: allow for canceling restriper · a7e99c69

Ilya Dryomov authored Jan 16, 2012

Implement an ioctl for canceling restriper.  Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit.  Balance item is deleted and no memory
about the interrupted balance is kept.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a7e99c69

Btrfs: allow for pausing restriper · 837d5b6e

Ilya Dryomov authored Jan 16, 2012

Implement an ioctl for pausing restriper.  This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc.  If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.

Add a hook to close_ctree() to pause restriper and free its data
structures on unmount.  (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

837d5b6e

Btrfs: add skip_balance mount option · 9555c6c1

Ilya Dryomov authored Jan 16, 2012

Since restriper kthread starts involuntarily on mount and can suck cpu
and memory bandwidth add a mount option to forcefully skip it.  The
restriper in that case hangs around in paused state and can be resumed
from userspace when it's convenient.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

9555c6c1

Btrfs: recover balance on mount · 59641015

Ilya Dryomov authored Jan 16, 2012

On mount, if balance item is found, resume balance in a separate
kernel thread.

Try to be smart to continue roughly where previous balance (or convert)
was interrupted.  For chunk types that were being converted to some
profile we turn on soft convert, in case of a simple balance we turn on
usage filter and relocate only less-than-90%-full chunks of that type.
These are just heuristics but they help quite a bit, and can be improved
in future.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

59641015

Btrfs: save balance parameters to disk · 0940ebf6

Ilya Dryomov authored Jan 16, 2012

Introduce a new btree objectid for storing balance item.  The reason is
to be able to resume restriper after a crash with the same parameters.
Balance item has a very high objectid and goes into tree of tree roots.

The key for the new item is as follows:

	[ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]

Older kernels simply ignore it so it's safe to mount with an older
kernel and then go back to the newer one.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

0940ebf6

Btrfs: soft profile changing mode (aka soft convert) · cfa4c961

Ilya Dryomov authored Jan 16, 2012

When doing convert from one profile to another if soft mode is on
restriper won't touch chunks that already have the profile we are
converting to.  This is useful if e.g. half of the FS was converted
earlier.

The soft mode switch is (like every other filter) per-type.  This means
that we can convert for example meta chunks the "hard" way while
converting data chunks selectively with soft switch.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

cfa4c961

Btrfs: implement online profile changing · e4d8ec0f

Ilya Dryomov authored Jan 16, 2012

Profile changing is done by launching a balance with
BTRFS_BALANCE_CONVERT bits set and target fields of respective
btrfs_balance_args structs initialized.  Profile reducing code in this
case will pick restriper's target profile if it's available instead of
doing a blind reduce.  If target profile is not yet available it goes
back to a plain reduce.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

e4d8ec0f

Btrfs: do not reduce profile in do_chunk_alloc() · 70922617

Ilya Dryomov authored Jan 16, 2012

Every caller of do_chunk_alloc() feeds it the reduced allocation
profile, so stop trying to reduce it one more time.  Instead check the
validity of the passed profile.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

70922617

Btrfs: virtual address space subset filter · ea67176a

Ilya Dryomov authored Jan 16, 2012

Select chunks which have at least one byte located inside a given
[vstart, vend) virtual address space range.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ea67176a

Btrfs: devid subset filter · 94e60d5a

Ilya Dryomov authored Jan 16, 2012

Select chunks which have at least one byte of at least one stripe
located on a device with devid X in a given [pstart,pend) physical
address range.

This filter only works when devid filter is turned on.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

94e60d5a

Btrfs: devid filter · 409d404b

Ilya Dryomov authored Jan 16, 2012

Relocate chunks which have at least one stripe located on a device with
devid X.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

409d404b

Btrfs: usage filter · 5ce5b3c0

Ilya Dryomov authored Jan 16, 2012

Select chunks that are less than X percent full.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

5ce5b3c0

Btrfs: profiles filter · ed25e9b2

Ilya Dryomov authored Jan 16, 2012

Select chunks based on a given profile mask.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ed25e9b2

Btrfs: add basic infrastructure for selective balancing · f43ffb60

Ilya Dryomov authored Jan 16, 2012

This allows to have a separate set of filters for each chunk type
(data,meta,sys).  The code however is generic and switch on chunk type
is only done once.

This commit also adds a type filter: it allows to balance for example
meta and system chunks w/o touching data ones.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

f43ffb60

Btrfs: add basic restriper infrastructure · c9e9f97b

Ilya Dryomov authored Jan 16, 2012

Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc.  The semantics of the old balancing
ioctl are fully preserved.

Explicitly disallow any volume operations when balance is in progress.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

c9e9f97b

Btrfs: make avail_*_alloc_bits fields dynamic · 10ea00f5

Ilya Dryomov authored Jan 16, 2012

Currently when new chunks are created respective avail_alloc_bits field
is updated to reflect profiles of all chunks present in the system.
However when chunks are removed profile bits are never cleared.

This patch clears profile bit of respective avail_alloc_bits field when
the last chunk with that profile is removed.  Restriper needs this to
properly operate when "downgrading".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

10ea00f5

Btrfs: add BTRFS_AVAIL_ALLOC_BIT_SINGLE bit · a46d11a8

Ilya Dryomov authored Jan 16, 2012

Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for
avail_{data,metadata,system}_alloc_bits fields, which gather info about
available allocation profiles in the FS. When chunk is created or read
from disk, its profile is OR'ed with the corresponding avail_alloc_bits
field. Since SINGLE is denoted by 0 in the on-disk format, currently
there is no way to tell when such chunks become avaialble. Restriper
needs that information, so add a separate bit for SINGLE profile.

This bit is going to be in-memory only, it should never be written out
to disk, so it's not a disk format change. However to avoid remappings
in future, reserve corresponding on-disk bit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a46d11a8

Btrfs: introduce masks for chunk type and profile · 52ba6929

Ilya Dryomov authored Jan 16, 2012

Chunk's type and profile are encoded in u64 flags field.  Introduce
masks to easily access them.  Also fix the type of BTRFS_BLOCK_GROUP_*
constants, it should be ULL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

52ba6929

Btrfs: get rid of *_alloc_profile fields · 6fef8df1

Ilya Dryomov authored Jan 16, 2012

{data,metadata,system}_alloc_profile fields have been unused for a long
time now.  Get rid of them.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6fef8df1

08 Jan, 2012 2 commits

Btrfs: revamp clustered allocation logic · 1bb91902

Alexandre Oliva authored Oct 14, 2011

Parameterize clusters on minimum total size, minimum chunk size and
minimum contiguous size for at least one chunk, without limits on
cluster, window or gap sizes. Don't tolerate any fragmentation for
SSD_SPREAD; accept it for metadata, but try to keep data dense.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1bb91902

Btrfs: don't set up allocation result twice · fc7c1077

Alexandre Oliva authored Nov 28, 2011

We store the allocation start and length twice in ins, once right
after the other, but with intervening calls that may prevent the
duplicate from being optimized out by the compiler.  Remove one of the
assignments.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

fc7c1077

06 Jan, 2012 4 commits

Btrfs: test free space only for unclustered allocation · a5f6f719

Alexandre Oliva authored Dec 12, 2011

Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained. Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.

I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a5f6f719

Btrfs: use bigger metadata chunks on bigger filesystems · 1100373f

Chris Mason authored Jan 06, 2012

The 256MB chunk is a little small on a huge FS.  This scales up the
chunk size.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1100373f

Btrfs: lower the bar for chunk allocation · cf1d72c9

Chris Mason authored Jan 06, 2012

The chunk allocation code has tried to keep a pretty tight lid on creating new
metadata chunks.  This is partially because in the past the reservation
code didn't give us an accurate idea of how much space was being used.

The new code is much more accurate, so we're able to get rid of some of these
checks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

cf1d72c9

Btrfs: run chunk allocations while we do delayed refs · 203bf287

Chris Mason authored Jan 06, 2012

Btrfs tries to batch extent allocation tree changes to improve performance
and reduce metadata trashing. But it doesn't allocate new metadata chunks
while it is doing allocations for the extent allocation tree.

This commit changes the delayed refence code to do chunk allocations if we're
getting low on room. It prevents crashes and improves performance.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

203bf287

04 Jan, 2012 11 commits

Linux 3.2 · 805a6af8
Linus Torvalds authored Jan 04, 2012

805a6af8

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 86968238

Linus Torvalds authored Jan 04, 2012

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  fix CAN MAINTAINERS SCM tree type
  mwifiex: fix crash during simultaneous scan and connect
  b43: fix regression in PIO case
  ath9k: Fix kernel panic in AR2427 in AP mode
  CAN MAINTAINERS update
  net: fsl: fec: fix build for mx23-only kernel
  sch_qfq: fix overflow in qfq_update_start()
  Revert "Bluetooth: Increase HCI reset timeout in hci_dev_do_close"

86968238

minixfs: misplaced checks lead to dentry leak · d6042eac

Al Viro authored Jan 04, 2012

bitmap size sanity checks should be done *before* allocating ->s_root;
there their cleanup on failure would be correct.  As it is, we do iput()
on root inode, but leak the root dentry...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d6042eac

ptrace: ensure JOBCTL_STOP_SIGMASK is not zero after detach · 8a88951b

Oleg Nesterov authored Jan 04, 2012

This is the temporary simple fix for 3.2, we need more changes in this
area.

1. do_signal_stop() assumes that the running untraced thread in the
   stopped thread group is not possible. This was our goal but it is
   not yet achieved: a stopped-but-resumed tracee can clone the running
   thread which can initiate another group-stop.

   Remove WARN_ON_ONCE(!current->ptrace).

2. A new thread always starts with ->jobctl = 0. If it is auto-attached
   and this group is stopped, __ptrace_unlink() sets JOBCTL_STOP_PENDING
   but JOBCTL_STOP_SIGMASK part is zero, this triggers WANR_ON(!signr)
   in do_jobctl_trap() if another debugger attaches.

   Change __ptrace_unlink() to set the artificial SIGSTOP for report.

   Alternatively we could change ptrace_init_task() to copy signr from
   current, but this means we can copy it for no reason and hide the
   possible similar problems.
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org>		[3.1]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8a88951b

ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race · 50b8d257

Oleg Nesterov authored Jan 04, 2012

Test-case:

	int main(void)
	{
		int pid, status;

		pid = fork();
		if (!pid) {
			for (;;) {
				if (!fork())
					return 0;
				if (waitpid(-1, &status, 0) < 0) {
					printf("ERR!! wait: %m\n");
					return 0;
				}
			}
		}

		assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
		assert(waitpid(-1, NULL, 0) == pid);

		assert(ptrace(PTRACE_SETOPTIONS, pid, 0,
					PTRACE_O_TRACEFORK) == 0);

		do {
			ptrace(PTRACE_CONT, pid, 0, 0);
			pid = waitpid(-1, NULL, 0);
		} while (pid > 0);

		return 1;
	}

It fails because ->real_parent sees its child in EXIT_DEAD state
while the tracer is going to change the state back to EXIT_ZOMBIE
in wait_task_zombie().

The offending commit is 823b018e which moved the EXIT_DEAD check,
but in fact we should not blame it. The original code was not
correct as well because it didn't take ptrace_reparented() into
account and because we can't really trust ->ptrace.

This patch adds the additional check to close this particular
race but it doesn't solve the whole problem. We simply can't
rely on ->ptrace in this case, it can be cleared if the tracer
is multithreaded by the exiting ->parent.

I think we should kill EXIT_DEAD altogether, we should always
remove the soon-to-be-reaped child from ->children or at least
we should never do the DEAD->ZOMBIE transition. But this is too
complex for 3.2.
Reported-and-tested-by: Denys Vlasenko <vda.linux@googlemail.com>
Tested-by: Lukasz Michalik <lmi@ift.uni.wroc.pl>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org>		[3.0+]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

50b8d257

Merge git://git.samba.org/sfrench/cifs-2.6 · 8d9cbf82

Linus Torvalds authored Jan 04, 2012

* git://git.samba.org/sfrench/cifs-2.6:
  [CIFS] default ntlmv2 for cifs mount delayed to 3.3
  cifs: fix bad buffer length check in coalesce_t2

8d9cbf82

Merge branch 'master' of... · d8f46ff1

John W. Linville authored Jan 04, 2012

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem

d8f46ff1

Revert "rtc: Expire alarms after the time is set." · f423fc62

Linus Torvalds authored Jan 04, 2012

This reverts commit 93b2ec01.

The call to "schedule_work()" in rtc_initialize_alarm() happens too
early, and can cause oopses at bootup

Neil Brown explains why we do it:

  "If you set an alarm in the future, then shutdown and boot again after
   that time, then you will end up with a timer_queue node which is in
   the past.

   When this happens the queue gets stuck.  That entry-in-the-past won't
   get removed until and interrupt happens and an interrupt won't happen
   because the RTC only triggers an interrupt when the alarm is "now".

   So you'll find that e.g.  "hwclock" will always tell you that
   'select' timed out.

   So we force the interrupt work to happen at the start just in case."

and has a patch that convert it to do things in-process rather than with
the worker thread, but right now it's too late to play around with this,
so we just revert the patch that caused problems for now.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Requested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Requested-by: John Stultz <john.stultz@linaro.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f423fc62

[CIFS] default ntlmv2 for cifs mount delayed to 3.3 · 225de11e

Steve French authored Jan 03, 2012

Turned out the ntlmv2 (default security authentication)
upgrade was harder to test than expected, and we ran
out of time to test against Apple and a few other servers
that we wanted to.  Delay upgrade of default security
from ntlm to ntlmv2 (on mount) to 3.3.  Still works
fine to specify it explicitly via "sec=ntlmv2" so this
should be fine.
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>

225de11e

cifs: fix bad buffer length check in coalesce_t2 · 497728e1

Jeff Layton authored Jan 01, 2012

The current check looks to see if the RFC1002 length is larger than
CIFSMaxBufSize, and fails if it is. The buffer is actually larger than
that by MAX_CIFS_HDR_SIZE.

This bug has been around for a long time, but the fact that we used to
cap the clients MaxBufferSize at the same level as the server tended
to paper over it. Commit c974befa changed that however and caused this
bug to bite in more cases.
Reported-and-Tested-by: Konstantinos Skarlatos <k.skarlatos@gmail.com>
Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>

497728e1

Revert "rtc: Disable the alarm in the hardware" · 157e8bf8

Linus Torvalds authored Jan 03, 2012

This reverts commit c0afabd3.

It causes failures on Toshiba laptops - instead of disabling the alarm,
it actually seems to enable it on the affected laptops, resulting in
(for example) the laptop powering on automatically five minutes after
shutdown.

There's a patch for it that appears to work for at least some people,
but it's too late to play around with this, so revert for now and try
again in the next merge window.

See for example

	http://bugs.debian.org/652869

Reported-and-bisected-by: Andreas Friedrich <afrie@gmx.net> (Toshiba Tecra)
Reported-by: Antonio-M. Corbi Bellot <antonio.corbi@ua.es> (Toshiba Portege R500)
Reported-by: Marco Santos <marco.santos@waynext.com> (Toshiba Portege Z830)
Reported-by: Christophe Vu-Brugier <cvubrugier@yahoo.fr>  (Toshiba Portege R830)
Cc: Jonathan Nieder <jrnieder@gmail.com>
Requested-by: John Stultz <john.stultz@linaro.org>
Cc: stable@kernel.org  # for the versions that applied this
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

157e8bf8