Commits · 769ba8d92025fa390f3097e658b8ed6e032d68e9 · nexedi / linux

25 Oct, 2011 1 commit

Boaz Harrosh authored Oct 14, 2011

This is finally the RAID5 Write support.

The bigger part of this patch is not the XOR engine itself, But the
read4write logic, which is a complete mini prepare_for_striping
reading engine that can read scattered pages of a stripe into cache
so it can be used for XOR calculation. That is, if the write was not
stripe aligned.

The main algorithm behind the XOR engine is the 2 dimensional array:
	struct __stripe_pages_2d.
A drawing might save 1000 words
---

__stripe_pages_2d
       |
 n = pages_in_stripe_unit;
 w = group_width - parity;
       |                            pages array presented to the XOR lib
       |                                                |
       V                                                |
 __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
       |                                                |
 __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
       |
...    |                         ...
       |
 __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
                               ^
                               |
           data added columns first then row

---
The pages are put on this array columns first. .i.e:
	p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
So we are doing a corner turn of the pages.

Note that pages will zigzag down and left. but are put sequentially
in growing order. So when the time comes to XOR the stripe, only the
beginning and end of the array need be checked. We scan the array
and any NULL spot will be field by pages-to-be-read.

The FS that wants to support RAID5 needs to supply an
operations-vector that searches a given page in cache, and specifies
if the page is uptodate or need reading. All these pages to be read
are put on a slave ore_io_state and synchronously read. All the pages
of a stripe are read in one IO, using the scatter gather mechanism.

In write we constrain our IO to only be incomplete on a single
stripe. Meaning either the complete IO is within a single stripe so
we might have pages to read from both beginning  or end of the
strip. Or we have some reading to do at beginning but end at strip
boundary. The left over pages are pushed to the next IO by the API
already established by previous work, where an IO offset/length
combination presented to the ORE might get the length truncated and
the user must re-submit the leftover pages. (Both exofs and NFS
support this)

But any ORE user should make it's best effort to align it's IO
before hand and avoid complications. A cached ore_layout->stripe_size
member can be used for that calculation. (NOTE: that ORE demands
that stripe_size may not be bigger then 32bit)

What else? Well read it and tell me.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

769ba8d9

24 Oct, 2011 3 commits

ore: RAID5 read · a1fec1db

Boaz Harrosh authored Oct 12, 2011

This patch introduces the first stage of RAID5 support
mainly the skip-over-raid-units when reading. For
writes it inserts BLANK units, into where XOR blocks
should be calculated and written to.

It introduces the new "general raid maths", and the main
additional parameters and components needed for raid5.

Since at this stage it could corrupt future version that
actually do support raid5. The enablement of raid5
mounting and setting of parity-count > 0 is disabled. So
the raid5 code will never be used. Mounting of raid5 is
only enabled later once the basic XOR write is also in.
But if the patch "enable RAID5" is applied this code has
been tested to be able to properly read raid5 volumes
and is according to standard.

Also it has been tested that the new maths still properly
supports RAID0 and grouping code just as before.
(BTW: I have found more bugs in the pnfs-obj RAID math
 fixed here)

The ore.c file is getting too big, so new ore_raid.[hc]
files are added that will include the special raid stuff
that are not used in striping and mirrors. In future write
support these will get bigger.
When adding the ore_raid.c to Kbuild file I was forced to
rename ore.ko to libore.ko. Is it possible to keep source
file, say ore.c and module file ore.ko the same even if there
are multiple files inside ore.ko?
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

a1fec1db

fs/Makefile: Always inspect exofs/ · 3e335672

Boaz Harrosh authored Oct 24, 2011

fs/exofs directory has multiple targets now, of which the
ore.ko will be needed by the pnfs-objects-layout-driver
(fs/nfs/objlayout).

As suggested by: Michal Marek <mmarek@suse.cz>  convert
inclusion of exofs/ from obj-$(CONFIG_EXOFS_FS) => obj-$(y).
So ORE can be selected also from fs/nfs/Kconfig

CC: Michal Marek <mmarek@suse.cz>
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

3e335672

ore: Make ore_calc_stripe_info EXPORT_SYMBOL · 611d7a5d

Boaz Harrosh authored Oct 04, 2011

ore_calc_stripe_info is needed by exofs::export.c
for the layout calculations. Make it exportable
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

611d7a5d

14 Oct, 2011 8 commits

ore/exofs: Change ore_check_io API · 4b46c9f5

Boaz Harrosh authored Sep 28, 2011

Current ore_check_io API receives a residual
pointer, to report partial IO. But it is actually
not used, because in a multiple devices IO there
is never a linearity in the IO failure.

On the other hand if every failing device is reported
through a received callback measures can be taken to
handle only failed devices. One at a time.

This will also be needed by the objects-layout-driver
for it's error reporting facility.

Exofs is not currently using the new information and
keeps the old behaviour of failing the complete IO in
case of an error. (No partial completion)

TODO: Use an ore_check_io callback to set_page_error only
the failing pages. And re-dirty write pages.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

4b46c9f5

ore/exofs: Define new ore_verify_layout · 5a51c0c7

Boaz Harrosh authored Sep 28, 2011

All users of the ore will need to check if current code
supports the given layout. For example RAID5/6 is not
currently supported.

So move all the checks from exofs/super.c to a new
ore_verify_layout() to be used by ore users.

Note that any new layout should be passed through the
ore_verify_layout() because the ore engine will prepare
and verify some internal members of ore_layout, and
assumes it's called.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

5a51c0c7

ore: Support for partial component table · 3bd98568

Boaz Harrosh authored Sep 28, 2011

Users like the objlayout-driver would like to only pass
a partial device table that covers the IO in question.
For example exofs divides the file into raid-group-sized
chunks and only serves group_width number of devices at
a time.

The partiality is communicated by setting
ore_componets->first_dev and the array covers all logical
devices from oc->first_dev upto (oc->first_dev + oc->numdevs)

The ore_comp_dev() API receives a logical device index
and returns the actual present device in the table.
An out-of-range dev_index will BUG.

Logical device index is the theoretical device index as if
all the devices of a file are present. .i.e:
	total_devs = group_width * mirror_p1 * group_count
	0 <= dev_index < total_devs
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

3bd98568

ore: Support for short read/writes · bbf9a31b

Boaz Harrosh authored Aug 26, 2011

Memory conditions and max_bio constraints might cause us to
not comply to the full length of the requested IO. Instead of
failing the complete IO we can issue a shorter read/write and
report how much was actually executed in the ios->length
member.

All users must check ios->length at IO_done or upon return of
ore_read/write and re-issue the reminder of the bytes. Because
other wise there is no error returned like before.

This is part of the effort to support the pnfs-obj layout driver.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

bbf9a31b

exofs: Support for short read/writes · 154a9300

Boaz Harrosh authored Aug 26, 2011

If at read/write_done the actual IO was shorter then requested,
reported in returned ios->length. It is not an error. The reminder
of the pages should just be unlocked but not marked uptodate or
end_page_writeback. They will be re issued later by the VFS.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

154a9300

ore: Remove check for ios->kern_buff in _prepare_for_striping to later · 6851a5e5

Boaz Harrosh authored Aug 24, 2011

Move the check and preparation of the ios->kern_buff case to
later inside _write_mirror().

Since read was never used with ios->kern_buff its support is removed
instead of fixed.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

6851a5e5

ore: cleanup: Embed an ore_striping_info inside ore_io_state · 98260754

Boaz Harrosh authored Oct 02, 2011

Now that each ore_io_state covers only a single raid group.
A single striping_info math is needed. Embed one inside
ore_io_state to cache the calculation results and eliminate
an extra call.

Also the outer _prepare_for_striping is removed since it does nothing.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

98260754

ore: Only IO one group at a time (API change) · b916c5cd

Boaz Harrosh authored Sep 28, 2011

Usually a single IO is confined to one group of devices
(group_width) and at the boundary of a raid group it can
spill into a second group. Current code would allocate a
full device_table size array at each io_state so it can
comply to requests that span two groups. Needless to say
that is very wasteful, specially when device_table count
can get very large (hundreds even thousands), while a
group_width is usually 8 or 10.

* Change ore API to trim on IO that spans two raid groups.
  The user passes offset+length to ore_get_rw_state, the
  ore might trim on that length if spanning a group boundary.
  The user must check ios->length or ios->nrpages to see
  how much IO will be preformed. It is the responsibility
  of the user to re-issue the reminder of the IO.

* Modify exofs To copy spilled pages on to the next IO.
  This means one last kick is needed after all coalescing
  of pages is done.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

b916c5cd

04 Oct, 2011 1 commit

ore/exofs: Change the type of the devices array (API change) · d866d875

Boaz Harrosh authored Sep 28, 2011

In the pNFS obj-LD the device table at the layout level needs
to point to a device_cache node, where it is possible and likely
that many layouts will point to the same device-nodes.

In Exofs we have a more orderly structure where we have a single
array of devices that repeats twice for a round-robin view of the
device table

This patch moves to a model that can be used by the pNFS obj-LD
where struct ore_components holds an array of ore_dev-pointers.
(ore_dev is newly defined and contains a struct osd_dev *od
 member)

Each pointer in the array of pointers will point to a bigger
user-defined dev_struct. That can be accessed by use of the
container_of macro.

In Exofs an __alloc_dev_table() function allocates the
ore_dev-pointers array as well as an exofs_dev array, in one
allocation and does the addresses dance to set everything pointing
correctly. It still keeps the double allocation trick for the
inodes round-robin view of the table.

The device table is always allocated dynamically, also for the
single device case. So it is unconditionally freed at umount.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

d866d875

03 Oct, 2011 5 commits

ore: Make ore_striping_info and ore_calc_stripe_info public · eb507bc1

Boaz Harrosh authored Aug 10, 2011

The struct ore_striping_info will be used later in other
structures. And ore_calc_stripe_info as well. Rename them
make struct ore_striping_info public. ore_calc_stripe_info
is still static, will be made public on first use.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

eb507bc1

exofs: Remove unused data_map member from exofs_sb_info · 8d2d83a8

Boaz Harrosh authored Aug 10, 2011

The struct pnfs_osd_data_map data_map member of exofs_sb_info was
never used after mount. In fact all it's members were duplicated
by the ore_layout structure. So just remove the duplicated information.

Also removed some stupid, but perfectly supported, restrictions on
layout parameters. The case where num_devices is not divisible by
mirror_count+1 is perfectly fine since the rotating device view
will eventually use all the devices it can get.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>

8d2d83a8

exofs: Rename struct ore_components comps => oc · 5bf696da

Boaz Harrosh authored Sep 28, 2011

ore_components already has a comps member so this leads
to things like comps->comps which is annoying. the name oc
was already used in new code. So rename all old usage of
ore_components comps => ore_components oc.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

5bf696da

exofs/super.c: local functions should be static · de74b05a

H Hartley Sweeten authored Sep 23, 2011

This quiets the following sparse noise:

warning: symbol 'exofs_sync_fs' was not declared. Should it be static?
warning: symbol 'exofs_free_sbi' was not declared. Should it be static?
warning: symbol 'exofs_get_parent' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

de74b05a

exofs/ore.c: local functions should be static · 1958c7c2

H Hartley Sweeten authored Sep 23, 2011

This quiets the sparse noise:

warning: symbol '_calc_trunk_info' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

1958c7c2

22 Sep, 2011 1 commit

osd: Kconfig remove wrong FIXME · a8f8c450

Boaz Harrosh authored Aug 15, 2011

The OSD protocol calls for all kind of security levels that use
CRYPTO_HMAC and SH1, but the current code only supports NO_SEC,
which does not use any of these.

Remove a wrong FIXME that calls for them. Thanks Maxin for
reporting on this.
Reported-by: "Maxin B. John" <maxin.john@gmail.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>

a8f8c450

12 Sep, 2011 9 commits

Linux 3.1-rc6 · b6fd41e2
Linus Torvalds authored Sep 12, 2011

b6fd41e2

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · 8cb3ed17

Linus Torvalds authored Sep 12, 2011

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm: Remove duplicate "return" statement
  drm/nv04/crtc: Bail out if FB is not bound to crtc
  drm/nouveau: fix nv04_sgdma_bind on non-"4kB pages" archs
  drm/nouveau: properly handle allocation failure in nouveau_sgdma_populate
  drm/nouveau: fix oops on pre-semaphore hardware
  drm/nv50/crtc: Bail out if FB is not bound to crtc
  drm/radeon/kms: fix DP detect and EDID fetch for DP bridges

8cb3ed17

Merge branch 'fixes' of git://git.linaro.org/people/arnd/arm-soc · 4c752782

Linus Torvalds authored Sep 12, 2011

* 'fixes' of git://git.linaro.org/people/arnd/arm-soc:
  ARM: CSR: add missing sentinels to of_device_id tables
  ARM: cns3xxx: Fix newly introduced warnings in the PCIe code
  ARM: cns3xxx: Fix compile error caused by hardware.h removed
  ARM: davinci: fix cache flush build error
  ARM: davinci: correct MDSTAT_STATE_MASK
  ARM: davinci: da850 EVM: read mac address from SPI flash
  OMAP: omap_device: fix !CONFIG_SUSPEND case in _noirq handlers
  OMAP2430: hwmod: musb: add missing terminator to omap2430_usbhsotg_addrs[]
  OMAP3: clock: indicate that gpt12_fck and wdt1_fck are in the WKUP clockdomain
  OMAP4: clock: fix compile warning
  OMAP4: clock: re-enable previous clockdomain enable/disable sequence
  OMAP: clockdomain: Wait for powerdomain to be ON when using clockdomain force wakeup
  OMAP: powerdomains: Make all powerdomain target states as ON at init

4c752782

ioctl: register LTTng ioctl · 14d01ff5

Mathieu Desnoyers authored Sep 11, 2011

The LTTng 2.0 kernel tracer (stand-alone module package, available at
http://lttng.org) uses the 0xF6 ioctl range for tracer control and
transport operations.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

14d01ff5

Merge branch 'for-linus' of git://github.com/chrismason/linux · 0b001b2e

Linus Torvalds authored Sep 12, 2011

* 'for-linus' of git://github.com/chrismason/linux:
  Btrfs: add dummy extent if dst offset excceeds file end in
  Btrfs: calc file extent num_bytes correctly in file clone
  btrfs: xattr: fix attribute removal
  Btrfs: fix wrong nbytes information of the inode
  Btrfs: fix the file extent gap when doing direct IO
  Btrfs: fix unclosed transaction handle in btrfs_cont_expand
  Btrfs: fix misuse of trans block rsv
  Btrfs: reset to appropriate block rsv after orphan operations
  Btrfs: skip locking if searching the commit root in csum lookup
  btrfs: fix warning in iput for bad-inode
  Btrfs: fix an oops when deleting snapshots

0b001b2e

fuse: fix memory leak · 5dfcc87f

Miklos Szeredi authored Sep 12, 2011

kmemleak is reporting that 32 bytes are being leaked by FUSE:

  unreferenced object 0xe373b270 (size 32):
  comm "fusermount", pid 1207, jiffies 4294707026 (age 2675.187s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<b05517d7>] kmemleak_alloc+0x27/0x50
    [<b0196435>] kmem_cache_alloc+0xc5/0x180
    [<b02455be>] fuse_alloc_forget+0x1e/0x20
    [<b0245670>] fuse_alloc_inode+0xb0/0xd0
    [<b01b1a8c>] alloc_inode+0x1c/0x80
    [<b01b290f>] iget5_locked+0x8f/0x1a0
    [<b0246022>] fuse_iget+0x72/0x1a0
    [<b02461da>] fuse_get_root_inode+0x8a/0x90
    [<b02465cf>] fuse_fill_super+0x3ef/0x590
    [<b019e56f>] mount_nodev+0x3f/0x90
    [<b0244e95>] fuse_mount+0x15/0x20
    [<b019d1bc>] mount_fs+0x1c/0xc0
    [<b01b5811>] vfs_kern_mount+0x41/0x90
    [<b01b5af9>] do_kern_mount+0x39/0xd0
    [<b01b7585>] do_mount+0x2e5/0x660
    [<b01b7966>] sys_mount+0x66/0xa0

This leak report is consistent and happens once per boot on
3.1.0-rc5-dirty.

This happens if a FORGET request is queued after the fuse device was
released.
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5dfcc87f

fuse: fix flock breakage · 24114504

Miklos Szeredi authored Sep 12, 2011

Commit 37fb3a30 ("fuse: fix flock") added in 3.1-rc4 caused flock() to
fail with ENOSYS with the kernel ABI version 7.16 or earlier.

Fix by falling back to testing FUSE_POSIX_LOCKS for ABI versions 7.16
and earlier.
Reported-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

24114504

Merge branch 'for_3.1/pm-fixes-2' of git://gitorious.org/khilman/linux-omap-pm into fixes · 15ce9286
Arnd Bergmann authored Sep 12, 2011

15ce9286
Merge branch 'sirf/fixes' into fixes · d035953e
Arnd Bergmann authored Sep 12, 2011

d035953e

11 Sep, 2011 12 commits

Merge branch 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus · 87adf1c6

Linus Torvalds authored Sep 11, 2011

* 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus:
  [media] vp7045: fix buffer setup
  [media] nuvoton-cir: simplify raw IR sample handling
  [media] [Resend] viacam: Don't explode if pci_find_bus() returns NULL
  [media] v4l2: Fix documentation of the codec device controls
  [media] gspca - sonixj: Fix the darkness of sensor om6802 in 320x240
  [media] gspca - sonixj: Fix wrong register mask for sensor om6802
  [media] gspca - ov519: Fix LED inversion of some ov519 webcams
  [media] pwc: precedence bug in pwc_init_controls()

87adf1c6

Merge branch 'for-linus' of git://openrisc.net/~jonas/linux · 14f69ec7

Linus Torvalds authored Sep 11, 2011

* 'for-linus' of git://openrisc.net/~jonas/linux:
  Add missing DMA ops
  openrisc: don't use pt_regs in struct sigcontext

14f69ec7

Btrfs: add dummy extent if dst offset excceeds file end in · d525e8ab

Li Zefan authored Sep 11, 2011

You can see there's no file extent with range [0, 4096]. Check this by
btrfsck:

 # btrfsck /dev/sda7
 root 5 inode 258 errors 100
 ...
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

d525e8ab

Btrfs: calc file extent num_bytes correctly in file clone · d72c0842

Li Zefan authored Sep 11, 2011

num_bytes should be 4096 not 12288.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

d72c0842

btrfs: xattr: fix attribute removal · 4815053a

David Sterba authored Sep 11, 2011

An attribute is not removed by 'setfattr -x attr file' and remains
visible in attr list. This makes xfstests/062 pass again.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4815053a

Btrfs: fix wrong nbytes information of the inode · a39f7521

Miao Xie authored Sep 11, 2011

If we write some data into the data hole of the file(no preallocation for this
hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
the other element--disk_i_size needn't be updated. At this condition, we must
update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
return 1).

 # mkfs.btrfs /dev/sdb1
 # mount /dev/sdb1 /mnt
 # touch /mnt/a
 # truncate -s 856002 /mnt/a
 # dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
 # umount /mnt
 # btrfsck /dev/sdb1
 root 5 inode 257 errors 400
 found 32768 bytes used err is 1
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a39f7521

Btrfs: fix the file extent gap when doing direct IO · 0c1a98c8

Miao Xie authored Sep 11, 2011

When we write some data to the place that is beyond the end of the file
in direct I/O mode, a data hole will be created. And Btrfs should insert
a file extent item that point to this hole into the fs tree. But unfortunately
Btrfs forgets doing it.

The following is a simple way to reproduce it:
 # mkfs.btrfs /dev/sdc2
 # mount /dev/sdc2 /test4
 # touch /test4/a
 # dd if=/dev/zero of=/test4/a seek=8 count=1 bs=4K oflag=direct conv=nocreat,notrunc
 # umount /test4
 # btrfsck /dev/sdc2
 root 5 inode 257 errors 100
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

0c1a98c8

Btrfs: fix unclosed transaction handle in btrfs_cont_expand · 5b397377

Miao Xie authored Sep 11, 2011

The function - btrfs_cont_expand() forgot to close the transaction handle before
it jump out the while loop. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

5b397377

Btrfs: fix misuse of trans block rsv · 98c9942a

Liu Bo authored Sep 11, 2011

At the beginning of create_pending_snapshot, trans->block_rsv is set
to pending->block_rsv and is used for snapshot things, however, when
it is done, we do not recover it as will.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

98c9942a

Btrfs: reset to appropriate block rsv after orphan operations · 65450aa6

Liu Bo authored Sep 11, 2011

While truncating free space cache, we forget to change trans->block_rsv
back to the original one, but leave it with the orphan_block_rsv, and
then with option inode_cache enable, it leads to countless warnings of
btrfs_alloc_free_block and btrfs_orphan_commit_root:

WARNING: at fs/btrfs/extent-tree.c:5711 btrfs_alloc_free_block+0x180/0x350 [btrfs]()
...
WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

65450aa6

Btrfs: skip locking if searching the commit root in csum lookup · ddf23b3f

Josef Bacik authored Sep 11, 2011

It's not enough to just search the commit root, since we could be cow'ing the
very block we need to search through, which would mean that its locked and we'll
still deadlock. So use path->skip_locking as well. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

ddf23b3f

btrfs: fix warning in iput for bad-inode · e0b6d65b

Sergei Trofimovich authored Sep 11, 2011

iput() shouldn't be called for inodes in I_NEW state.
We need to mark inode as constructed first.

WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
Call Trace:
 [<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff8103e805>] warn_slowpath_null+0x15/0x20
 [<ffffffff810eaf0b>] iput+0x20b/0x210
 [<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0
 [<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210
 [<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0
 [<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0
 [<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280
 [<ffffffff8105af86>] kthread+0x96/0xa0
 [<ffffffff814336d4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190
 [<ffffffff814336d0>] ? gs_change+0xb/0xb
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Konstantin Khlebnikov <khlebnikov@openvz.org>
Tested-by: David Sterba <dsterba@suse.cz>
CC: Josef Bacik <josef@redhat.com>
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e0b6d65b