Commits · 6165572c1139dd694afb8e382a5f06e7e0fa4ad8 · nexedi / linux

19 Jun, 2017 40 commits

btrfs: use GFP_KERNEL in btrfs_init_dev_replace_tgtdev · 6165572c

David Sterba authored Jun 15, 2017

The function is called from ioctl context and we don't hold any locks
that take part in writeback. Right now it's only fs_info::volume_mutex.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6165572c

btrfs: use GFP_KERNEL in btrfs_calc_avail_data_space · 6a44517d

David Sterba authored Jun 15, 2017

We don't hold any locks here. Inidirectly called from statfs.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6a44517d

btrfs: Use btrfs_space_info_used instead of opencoding it · 0eee8a49

Nikolay Borisov authored Jun 14, 2017

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0eee8a49

btrfs: wait part of the write_dev_flush() can be separated out · 4fc6441a

Anand Jain authored Jun 13, 2017

Submit and wait parts of write_dev_flush() can be split into two
separate functions for better readability.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4fc6441a

btrfs: remove redundant null bdev counting during flush submission · cea7c8bf

Anand Jain authored Jun 13, 2017

There is no extra benefit to count null bdev during the submit loop,
as these null devices will be anyway checked during command
completion device loop just after the submit loop. We are holding the
device_list_mutex, the device->bdev status won't change in between.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

cea7c8bf

btrfs: write_dev_flush does not return ENOMEM anymore · 12b9bf0b

Anand Jain authored Jun 13, 2017

Since commit "btrfs: btrfs_io_bio_alloc never fails, skip error handling"
write_dev_flush will not return ENOMEM in the sending part. We do not
need to check for it in the callers.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>

12b9bf0b

Btrfs: compression must free at least one sector size · 170607eb

Timofey Titovets authored Jun 06, 2017

We already skip storing data where compression does not make the result
at least one byte less.  Let's make the logic better and check
that compression frees at least one sector size of bytes, otherwise it's
not that useful.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>

170607eb

btrfs: sink gfp parameter to btrfs_io_bio_alloc · c5e4c3d7

David Sterba authored Jun 12, 2017

We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c5e4c3d7

btrfs: add helper to initialize the non-bio part of btrfs_io_bio · 184f999e

David Sterba authored Jun 12, 2017

We use btrfs_bioset for bios and ask to allocate the entire size of
btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
initialized but the bytes from 0 to offset of 'bio' are left
uninitialized. Although we initialize some of the members in our
helpers, we should initialize the whole structures.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

184f999e

btrfs: document mandatory order of bio in btrfs_io_bio · fa1bcbe0
David Sterba authored Jun 12, 2017
```
Signed-off-by: David Sterba <dsterba@suse.com>
```
fa1bcbe0

Btrfs: btrfs_ioctl_search_key documentation · 1a63143d

Hans van Kranenburg authored Jun 06, 2017

A programmer who is trying to implement calling the btrfs SEARCH
or SEARCH_V2 ioctl will probably soon end up reading this struct
definition.

Properly document the input fields to prevent common misconceptions:
 1. The search space is linear, not 3 dimensional. The invidual min/max
 values for objectid, type and offset cannot be used to filter the
 result, they only define the endpoints of an interval.
 2. The transaction id (a.k.a. generation) filter applies only on
 transaction id of the last COW operation on a whole metadata page, not
 on individual items.

Ad 1. The first misunderstanding was helped by the previous misleading
comments on min/max type and offset:
  "keys returned will be >= min and <= max".

Ad 2. For example, running btrfs balance will happily cause rewriting of
metadata pages that contain a filesystem tree of a read only subvolume,
causing transids to be increased.

Also, improve descriptions of tree_id and nr_items and add in/out
annotations.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1a63143d

Btrfs: skip checksum verification if IO error occurs · ef7cdac1

Liu Bo authored Apr 13, 2017

Currently dio read also goes to verify checksum if -EIO has been returned,
although it usually fails on checksum, it's not necessary at all, we could
directly check if there is another copy to read.

And with this, the behavior of dio read is now consistent with that of
buffered read.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use bool for uptodate ]
Signed-off-by: David Sterba <dsterba@suse.com>

ef7cdac1

Btrfs: tolerate errors if we have retried successfully · e3d37fab

Liu Bo authored May 17, 2017

With raid1 profile, dio read isn't tolerating IO errors if read length is
less than the stripe length (64K).

Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags &
BTRFS_DIO_ORIG_BIO_SUBMITTED) is true and that happens when the read
length is less than 64k.  In this case, if the underlying device returns
error somehow, bio->bi_error has recorded that error.

If we could recover the correct data from another copy in profile raid1/10/5/6,
with btrfs_subio_endio_read() returning 0, bio would have the correct data in
its vector, but bio->bi_error is not updated accordingly so that the following
dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed.

This fixes the problem by setting bio's error to 0 if a good copy has been
found.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e3d37fab

btrfs: pass bytes to btrfs_bio_alloc · c821e7f3

David Sterba authored Jun 02, 2017

Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that
in the helper and simplify the logic in the callsers.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c821e7f3

btrfs: opencode trivial compressed_bio_alloc, simplify error handling · 9886b174

David Sterba authored Jun 02, 2017

compressed_bio_alloc is now a trivial wrapper around btrfs_bio_alloc, no
point keeping it. The error handling can be simplified, as we know
btrfs_bio_alloc will never fail.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

9886b174

btrfs: remove redundant parameters from btrfs_bio_alloc · 9f2179a5

David Sterba authored Jun 02, 2017

All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.

submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.
Signed-off-by: David Sterba <dsterba@suse.com>

9f2179a5

btrfs: sink gfp parameter to btrfs_bio_clone · 8b6c1d56

David Sterba authored Jun 02, 2017

All callers pass GFP_NOFS.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8b6c1d56

btrfs: btrfs_io_bio_alloc never fails, skip error handling · e4f56903

David Sterba authored Jun 02, 2017

Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e4f56903

btrfs: btrfs_bio_clone never fails, skip error handling · 3aa8e074

David Sterba authored Jun 02, 2017

Update direct callers of btrfs_bio_clone that do error handling, that we
can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3aa8e074

btrfs: btrfs_bio_alloc never fails, skip error handling · 0c4dd97c

David Sterba authored Jun 02, 2017

Update direct callers of btrfs_bio_alloc that do error handling, that we
can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0c4dd97c

btrfs: bioset allocations will never fail, adapt our helpers · 6e707bcd

David Sterba authored Jun 02, 2017

Christoph pointed out that bio allocations backed by a bioset will never
fail.  As we always use a bioset for all bio allocations, we can skip
the error handling.  This patch adjusts our low-level helpers, the
cascaded changes to all callers will come next.

CC: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6e707bcd

btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace · 6acafd1e

David Sterba authored May 31, 2017

The compression workspace buffers are larger than a page so we use
vmalloc, unconditionally. This is not always necessary as there might be
contiguous memory available.

Let's use the kvmalloc helpers that will try kmalloc first and fallback
to vmalloc. For that they require GFP_KERNEL flags. As we now have the
alloc_workspace calls protected by memalloc_nofs in the critical
contexts, we can safely use GFP_KERNEL.
Signed-off-by: David Sterba <dsterba@suse.com>

6acafd1e

btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspace · 389a6cfc

David Sterba authored May 31, 2017

As alloc_workspace is now protected by memalloc_nofs where needed,
we can switch the kmalloc to use GFP_KERNEL.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

389a6cfc

btrfs: add memalloc_nofs protections around alloc_workspace callback · fe308533

David Sterba authored May 31, 2017

The workspaces are preallocated at the beginning where we can safely use
GFP_KERNEL, but in some cases the find_workspace might reach the
allocation again, now in a more restricted context when the bios or
pages are being compressed.

To avoid potential lockup when alloc_workspace -> vmalloc would silently
use the GFP_KERNEL, add the memalloc_nofs helpers around the critical
call site.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

fe308533

btrfs: adjust includes after vmalloc removal · adf02123

David Sterba authored May 31, 2017

As we don't use vmalloc/vzalloc/vfree directly in ctree.c, we can now
use the proper header that defines kvmalloc.
Signed-off-by: David Sterba <dsterba@suse.com>

adf02123

btrfs: use GFP_KERNEL in init_ipath · f54de068

David Sterba authored May 31, 2017

Now that init_ipath is called either from a safe context or with
memalloc_nofs protection, we can switch to GFP_KERNEL allocations in
init_path and init_data_container.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f54de068

btrfs: scrub: add memalloc_nofs protection around init_ipath · de2491fd

David Sterba authored May 31, 2017

init_ipath is called from a safe ioctl context and from scrub when
printing an error.  The protection is added for three reasons:

* init_data_container calls vmalloc and this does not work as expected
  in the GFP_NOFS context, so this silently does GFP_KERNEL and might
  deadlock in some cases
* keep the context constraint of GFP_NOFS, used by scrub
* we want to use GFP_KERNEL unconditionally inside init_ipath or its
  callees
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

de2491fd

btrfs: send: use kvmalloc in iterate_dir_item · f11f7441

David Sterba authored May 31, 2017

We use a growing buffer for xattrs larger than a page size, at some
point vmalloc is unconditionally used for larger buffers. We can still
try to avoid it using the kvmalloc helper.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f11f7441

btrfs: replace opencoded kvzalloc with the helper · 818e010b

David Sterba authored May 31, 2017

The logic of kmalloc and vmalloc fallback is opencoded in
several places, we can now use the existing helper.
Signed-off-by: David Sterba <dsterba@suse.com>

818e010b

Btrfs: lzo: compressed data size must be less then input size · 1e9d7291

Timofey Titovets authored May 30, 2017

Logic already skips if compression makes data bigger, let's sync lzo
with zlib and also return error if compressed size is equal to
input size.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>

1e9d7291

btrfs: simplify code with bio_io_error · 054ec2f6

Guoqing Jiang authored Jun 02, 2017

bio_io_error was introduced in the commit 4246a0b6
("block: add a bi_error field to struct bio"), so use it to simplify
code.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

054ec2f6

Btrfs: use memalloc_nofs and kvzalloc() for free space tree bitmaps · 25ff17e8

Omar Sandoval authored Jun 05, 2017

First, instead of open-coding the vmalloc() fallback, use the new
kvzalloc() helper. Second, use memalloc_nofs_{save,restore}() instead of
GFP_NOFS, as vmalloc() uses some GFP_KERNEL allocations internally which
could lead to deadlocks.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

25ff17e8

btrfs: use generic slab for for btrfs_transaction · 4b5faeac

David Sterba authored Mar 28, 2017

Observing the number of slab objects of btrfs_transaction, there's just
one active on an almost quiescent filesystem, and the number of objects
goes to about ten when sync is in progress. Then the nubmer goes down to
1.  This matches the expectations of the transaction lifetime.

For such use the separate slab cache is not justified, as we do not
reuse objects frequently. For the shortlived transaction, the generic
slab (size 512) should be ok. We can optimistically expect that the 512
slabs are not all used (fragmentation) and there are free slots to take
when we do the allocation, compared to potentially allocating a whole new
page for the separate slab.

We'll lose the stats about the object use, which could be added later if
we really need them.
Signed-off-by: David Sterba <dsterba@suse.com>

4b5faeac

btrfs: scrub: embed scrub_wr_ctx into scrub context · 3fb99303

David Sterba authored May 16, 2017

The structure scrub_wr_ctx is not used anywhere just the scrub context,
we can move the members there. The tgtdev is renamed so it's more clear
that it belongs to the "wr" part.
Signed-off-by: David Sterba <dsterba@suse.com>

3fb99303

btrfs: scrub: use fs_info::sectorsize and drop it from scrub context · 25cc1226

David Sterba authored May 16, 2017

As we now have the node/block sizes in fs_info, we can use them and can
drop the local copies.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

25cc1226

Btrfs: add statx support · 04a87e34

Yonghong Song authored May 12, 2017

Return enhanced file attributes from the btrfs, including:
  (1). inode creation time as stx_btime, and
  (2). Certain BTRFS_INODE_xxx flags are mapped to stx_attributes flags.

Example output:
	[root@localhost ~]# cat t.sh
	touch t
	chattr +aic t
	~/linux/samples/statx/test-statx t
	chattr -aic t
	touch t
	echo "========================================"
	~/linux/samples/statx/test-statx t
	/bin/rm t
	[root@localhost ~]# ./t.sh
	statx(t) = 0
	results=fff
  	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 00:1c           Inode: 63962       Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2017-05-11 16:03:13.999856591-0700
	Modify: 2017-05-11 16:03:13.999856591-0700
	Change: 2017-05-11 16:03:14.000856663-0700
 	 Birth: 2017-05-11 16:03:13.999856591-0700
	Attributes: 0000000000000034 (........ ........ ........ ........ ........ ........ ........ .-ai.c..)
	========================================
	statx(t) = 0
	results=fff
	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 00:1c           Inode: 63962       Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2017-05-11 16:03:14.006857097-0700
	Modify: 2017-05-11 16:03:14.006857097-0700
	Change: 2017-05-11 16:03:14.006857097-0700
 	Birth: 2017-05-11 16:03:13.999856591-0700
	Attributes: 0000000000000000 (........ ........ ........ ........ ........ ........ ........ .---.-..)
	[root@localhost ~]#
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

04a87e34

Btrfs: lzo: fix typo in error message after failed deflate · 036b0217

Timofey Titovets authored May 25, 2017

Fix copy paste typo in debug message for lzo.c, lzo is not deflate.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

036b0217

btrfs: btrfs_wait_tree_block_writeback can be void return · 3189ff77

Jeff Layton authored May 25, 2017

Nothing checks its return value.

Is it safe to skip checking return value of btrfs_wait_tree_block_writeback?

Liu Bo: I think yes, it's used in walk_log_tree which is called in two
places, free_log_tree and log replay.  For free_log_tree, it waits for
any running writeback of the extent buffer under freeing to finish in
case we need to access the eb pointer from page->private, and it's OK to
not check the return value, while for log replay, it's doesn't wait
because wc->wait is not set. So neither cares about the writeback error.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ added more explanation to changelog, from Liu Bo ]
Signed-off-by: David Sterba <dsterba@suse.com>

3189ff77

btrfs: remove __BTRFS_LEAF_DATA_SIZE · 118c701e

Nikolay Borisov authored May 22, 2017

__BTRFS_LAF_DATA_SIZE is used only by BTRFS_LEAF_DATA_SIZE. Make the
latter subsume the former.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

118c701e

btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET · 3d9ec8c4

Nikolay Borisov authored May 29, 2017

Commit 5f39d397 ("Btrfs: Create extent_buffer interface
for large blocksizes") refactored btrfs_leaf_data function to take
extent_buffer rather than struct btrfs_leaf. However, as it turns out the
parameter being passed is never used. Furthermore this function no longer
returns the leaf data but rather the offset to it. So rename the function
to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
helpers and turn it into a macro.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ removed () from the macro ]
Signed-off-by: David Sterba <dsterba@suse.com>

3d9ec8c4