- 25 Oct, 2011 1 commit
-
-
Boaz Harrosh authored
This is finally the RAID5 Write support. The bigger part of this patch is not the XOR engine itself, But the read4write logic, which is a complete mini prepare_for_striping reading engine that can read scattered pages of a stripe into cache so it can be used for XOR calculation. That is, if the write was not stripe aligned. The main algorithm behind the XOR engine is the 2 dimensional array: struct __stripe_pages_2d. A drawing might save 1000 words --- __stripe_pages_2d | n = pages_in_stripe_unit; w = group_width - parity; | pages array presented to the XOR lib | | V | __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---| | | __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <--- | ... | ... | __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par] ^ | data added columns first then row --- The pages are put on this array columns first. .i.e: p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ... So we are doing a corner turn of the pages. Note that pages will zigzag down and left. but are put sequentially in growing order. So when the time comes to XOR the stripe, only the beginning and end of the array need be checked. We scan the array and any NULL spot will be field by pages-to-be-read. The FS that wants to support RAID5 needs to supply an operations-vector that searches a given page in cache, and specifies if the page is uptodate or need reading. All these pages to be read are put on a slave ore_io_state and synchronously read. All the pages of a stripe are read in one IO, using the scatter gather mechanism. In write we constrain our IO to only be incomplete on a single stripe. Meaning either the complete IO is within a single stripe so we might have pages to read from both beginning or end of the strip. Or we have some reading to do at beginning but end at strip boundary. The left over pages are pushed to the next IO by the API already established by previous work, where an IO offset/length combination presented to the ORE might get the length truncated and the user must re-submit the leftover pages. (Both exofs and NFS support this) But any ORE user should make it's best effort to align it's IO before hand and avoid complications. A cached ore_layout->stripe_size member can be used for that calculation. (NOTE: that ORE demands that stripe_size may not be bigger then 32bit) What else? Well read it and tell me. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
- 24 Oct, 2011 3 commits
-
-
Boaz Harrosh authored
This patch introduces the first stage of RAID5 support mainly the skip-over-raid-units when reading. For writes it inserts BLANK units, into where XOR blocks should be calculated and written to. It introduces the new "general raid maths", and the main additional parameters and components needed for raid5. Since at this stage it could corrupt future version that actually do support raid5. The enablement of raid5 mounting and setting of parity-count > 0 is disabled. So the raid5 code will never be used. Mounting of raid5 is only enabled later once the basic XOR write is also in. But if the patch "enable RAID5" is applied this code has been tested to be able to properly read raid5 volumes and is according to standard. Also it has been tested that the new maths still properly supports RAID0 and grouping code just as before. (BTW: I have found more bugs in the pnfs-obj RAID math fixed here) The ore.c file is getting too big, so new ore_raid.[hc] files are added that will include the special raid stuff that are not used in striping and mirrors. In future write support these will get bigger. When adding the ore_raid.c to Kbuild file I was forced to rename ore.ko to libore.ko. Is it possible to keep source file, say ore.c and module file ore.ko the same even if there are multiple files inside ore.ko? Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
fs/exofs directory has multiple targets now, of which the ore.ko will be needed by the pnfs-objects-layout-driver (fs/nfs/objlayout). As suggested by: Michal Marek <mmarek@suse.cz> convert inclusion of exofs/ from obj-$(CONFIG_EXOFS_FS) => obj-$(y). So ORE can be selected also from fs/nfs/Kconfig CC: Michal Marek <mmarek@suse.cz> CC: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
ore_calc_stripe_info is needed by exofs::export.c for the layout calculations. Make it exportable Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
- 14 Oct, 2011 8 commits
-
-
Boaz Harrosh authored
Current ore_check_io API receives a residual pointer, to report partial IO. But it is actually not used, because in a multiple devices IO there is never a linearity in the IO failure. On the other hand if every failing device is reported through a received callback measures can be taken to handle only failed devices. One at a time. This will also be needed by the objects-layout-driver for it's error reporting facility. Exofs is not currently using the new information and keeps the old behaviour of failing the complete IO in case of an error. (No partial completion) TODO: Use an ore_check_io callback to set_page_error only the failing pages. And re-dirty write pages. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
All users of the ore will need to check if current code supports the given layout. For example RAID5/6 is not currently supported. So move all the checks from exofs/super.c to a new ore_verify_layout() to be used by ore users. Note that any new layout should be passed through the ore_verify_layout() because the ore engine will prepare and verify some internal members of ore_layout, and assumes it's called. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
Users like the objlayout-driver would like to only pass a partial device table that covers the IO in question. For example exofs divides the file into raid-group-sized chunks and only serves group_width number of devices at a time. The partiality is communicated by setting ore_componets->first_dev and the array covers all logical devices from oc->first_dev upto (oc->first_dev + oc->numdevs) The ore_comp_dev() API receives a logical device index and returns the actual present device in the table. An out-of-range dev_index will BUG. Logical device index is the theoretical device index as if all the devices of a file are present. .i.e: total_devs = group_width * mirror_p1 * group_count 0 <= dev_index < total_devs Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
Memory conditions and max_bio constraints might cause us to not comply to the full length of the requested IO. Instead of failing the complete IO we can issue a shorter read/write and report how much was actually executed in the ios->length member. All users must check ios->length at IO_done or upon return of ore_read/write and re-issue the reminder of the bytes. Because other wise there is no error returned like before. This is part of the effort to support the pnfs-obj layout driver. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
If at read/write_done the actual IO was shorter then requested, reported in returned ios->length. It is not an error. The reminder of the pages should just be unlocked but not marked uptodate or end_page_writeback. They will be re issued later by the VFS. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
Move the check and preparation of the ios->kern_buff case to later inside _write_mirror(). Since read was never used with ios->kern_buff its support is removed instead of fixed. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
Now that each ore_io_state covers only a single raid group. A single striping_info math is needed. Embed one inside ore_io_state to cache the calculation results and eliminate an extra call. Also the outer _prepare_for_striping is removed since it does nothing. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
Usually a single IO is confined to one group of devices (group_width) and at the boundary of a raid group it can spill into a second group. Current code would allocate a full device_table size array at each io_state so it can comply to requests that span two groups. Needless to say that is very wasteful, specially when device_table count can get very large (hundreds even thousands), while a group_width is usually 8 or 10. * Change ore API to trim on IO that spans two raid groups. The user passes offset+length to ore_get_rw_state, the ore might trim on that length if spanning a group boundary. The user must check ios->length or ios->nrpages to see how much IO will be preformed. It is the responsibility of the user to re-issue the reminder of the IO. * Modify exofs To copy spilled pages on to the next IO. This means one last kick is needed after all coalescing of pages is done. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
- 04 Oct, 2011 1 commit
-
-
Boaz Harrosh authored
In the pNFS obj-LD the device table at the layout level needs to point to a device_cache node, where it is possible and likely that many layouts will point to the same device-nodes. In Exofs we have a more orderly structure where we have a single array of devices that repeats twice for a round-robin view of the device table This patch moves to a model that can be used by the pNFS obj-LD where struct ore_components holds an array of ore_dev-pointers. (ore_dev is newly defined and contains a struct osd_dev *od member) Each pointer in the array of pointers will point to a bigger user-defined dev_struct. That can be accessed by use of the container_of macro. In Exofs an __alloc_dev_table() function allocates the ore_dev-pointers array as well as an exofs_dev array, in one allocation and does the addresses dance to set everything pointing correctly. It still keeps the double allocation trick for the inodes round-robin view of the table. The device table is always allocated dynamically, also for the single device case. So it is unconditionally freed at umount. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
- 03 Oct, 2011 5 commits
-
-
Boaz Harrosh authored
The struct ore_striping_info will be used later in other structures. And ore_calc_stripe_info as well. Rename them make struct ore_striping_info public. ore_calc_stripe_info is still static, will be made public on first use. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
Boaz Harrosh authored
The struct pnfs_osd_data_map data_map member of exofs_sb_info was never used after mount. In fact all it's members were duplicated by the ore_layout structure. So just remove the duplicated information. Also removed some stupid, but perfectly supported, restrictions on layout parameters. The case where num_devices is not divisible by mirror_count+1 is perfectly fine since the rotating device view will eventually use all the devices it can get. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com>
-
Boaz Harrosh authored
ore_components already has a comps member so this leads to things like comps->comps which is annoying. the name oc was already used in new code. So rename all old usage of ore_components comps => ore_components oc. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
H Hartley Sweeten authored
This quiets the following sparse noise: warning: symbol 'exofs_sync_fs' was not declared. Should it be static? warning: symbol 'exofs_free_sbi' was not declared. Should it be static? warning: symbol 'exofs_get_parent' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
H Hartley Sweeten authored
This quiets the sparse noise: warning: symbol '_calc_trunk_info' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
- 22 Sep, 2011 1 commit
-
-
Boaz Harrosh authored
The OSD protocol calls for all kind of security levels that use CRYPTO_HMAC and SH1, but the current code only supports NO_SEC, which does not use any of these. Remove a wrong FIXME that calls for them. Thanks Maxin for reporting on this. Reported-by: "Maxin B. John" <maxin.john@gmail.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
-
- 12 Sep, 2011 9 commits
-
-
Linus Torvalds authored
-
git://people.freedesktop.org/~airlied/linuxLinus Torvalds authored
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: drm: Remove duplicate "return" statement drm/nv04/crtc: Bail out if FB is not bound to crtc drm/nouveau: fix nv04_sgdma_bind on non-"4kB pages" archs drm/nouveau: properly handle allocation failure in nouveau_sgdma_populate drm/nouveau: fix oops on pre-semaphore hardware drm/nv50/crtc: Bail out if FB is not bound to crtc drm/radeon/kms: fix DP detect and EDID fetch for DP bridges
-
git://git.linaro.org/people/arnd/arm-socLinus Torvalds authored
* 'fixes' of git://git.linaro.org/people/arnd/arm-soc: ARM: CSR: add missing sentinels to of_device_id tables ARM: cns3xxx: Fix newly introduced warnings in the PCIe code ARM: cns3xxx: Fix compile error caused by hardware.h removed ARM: davinci: fix cache flush build error ARM: davinci: correct MDSTAT_STATE_MASK ARM: davinci: da850 EVM: read mac address from SPI flash OMAP: omap_device: fix !CONFIG_SUSPEND case in _noirq handlers OMAP2430: hwmod: musb: add missing terminator to omap2430_usbhsotg_addrs[] OMAP3: clock: indicate that gpt12_fck and wdt1_fck are in the WKUP clockdomain OMAP4: clock: fix compile warning OMAP4: clock: re-enable previous clockdomain enable/disable sequence OMAP: clockdomain: Wait for powerdomain to be ON when using clockdomain force wakeup OMAP: powerdomains: Make all powerdomain target states as ON at init
-
Mathieu Desnoyers authored
The LTTng 2.0 kernel tracer (stand-alone module package, available at http://lttng.org) uses the 0xF6 ioctl range for tracer control and transport operations. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
git://github.com/chrismason/linuxLinus Torvalds authored
* 'for-linus' of git://github.com/chrismason/linux: Btrfs: add dummy extent if dst offset excceeds file end in Btrfs: calc file extent num_bytes correctly in file clone btrfs: xattr: fix attribute removal Btrfs: fix wrong nbytes information of the inode Btrfs: fix the file extent gap when doing direct IO Btrfs: fix unclosed transaction handle in btrfs_cont_expand Btrfs: fix misuse of trans block rsv Btrfs: reset to appropriate block rsv after orphan operations Btrfs: skip locking if searching the commit root in csum lookup btrfs: fix warning in iput for bad-inode Btrfs: fix an oops when deleting snapshots
-
Miklos Szeredi authored
kmemleak is reporting that 32 bytes are being leaked by FUSE: unreferenced object 0xe373b270 (size 32): comm "fusermount", pid 1207, jiffies 4294707026 (age 2675.187s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<b05517d7>] kmemleak_alloc+0x27/0x50 [<b0196435>] kmem_cache_alloc+0xc5/0x180 [<b02455be>] fuse_alloc_forget+0x1e/0x20 [<b0245670>] fuse_alloc_inode+0xb0/0xd0 [<b01b1a8c>] alloc_inode+0x1c/0x80 [<b01b290f>] iget5_locked+0x8f/0x1a0 [<b0246022>] fuse_iget+0x72/0x1a0 [<b02461da>] fuse_get_root_inode+0x8a/0x90 [<b02465cf>] fuse_fill_super+0x3ef/0x590 [<b019e56f>] mount_nodev+0x3f/0x90 [<b0244e95>] fuse_mount+0x15/0x20 [<b019d1bc>] mount_fs+0x1c/0xc0 [<b01b5811>] vfs_kern_mount+0x41/0x90 [<b01b5af9>] do_kern_mount+0x39/0xd0 [<b01b7585>] do_mount+0x2e5/0x660 [<b01b7966>] sys_mount+0x66/0xa0 This leak report is consistent and happens once per boot on 3.1.0-rc5-dirty. This happens if a FORGET request is queued after the fuse device was released. Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miklos Szeredi authored
Commit 37fb3a30 ("fuse: fix flock") added in 3.1-rc4 caused flock() to fail with ENOSYS with the kernel ABI version 7.16 or earlier. Fix by falling back to testing FUSE_POSIX_LOCKS for ABI versions 7.16 and earlier. Reported-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
-
Arnd Bergmann authored
-
- 11 Sep, 2011 12 commits
-
-
git://linuxtv.org/mchehab/for_linusLinus Torvalds authored
* 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus: [media] vp7045: fix buffer setup [media] nuvoton-cir: simplify raw IR sample handling [media] [Resend] viacam: Don't explode if pci_find_bus() returns NULL [media] v4l2: Fix documentation of the codec device controls [media] gspca - sonixj: Fix the darkness of sensor om6802 in 320x240 [media] gspca - sonixj: Fix wrong register mask for sensor om6802 [media] gspca - ov519: Fix LED inversion of some ov519 webcams [media] pwc: precedence bug in pwc_init_controls()
-
git://openrisc.net/~jonas/linuxLinus Torvalds authored
* 'for-linus' of git://openrisc.net/~jonas/linux: Add missing DMA ops openrisc: don't use pt_regs in struct sigcontext
-
Li Zefan authored
You can see there's no file extent with range [0, 4096]. Check this by btrfsck: # btrfsck /dev/sda7 root 5 inode 258 errors 100 ... Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Li Zefan authored
num_bytes should be 4096 not 12288. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
David Sterba authored
An attribute is not removed by 'setfattr -x attr file' and remains visible in attr list. This makes xfstests/062 pass again. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Miao Xie authored
If we write some data into the data hole of the file(no preallocation for this hole), Btrfs will allocate some disk space, and update nbytes of the inode, but the other element--disk_i_size needn't be updated. At this condition, we must update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size() return 1). # mkfs.btrfs /dev/sdb1 # mount /dev/sdb1 /mnt # touch /mnt/a # truncate -s 856002 /mnt/a # dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc # umount /mnt # btrfsck /dev/sdb1 root 5 inode 257 errors 400 found 32768 bytes used err is 1 Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Miao Xie authored
When we write some data to the place that is beyond the end of the file in direct I/O mode, a data hole will be created. And Btrfs should insert a file extent item that point to this hole into the fs tree. But unfortunately Btrfs forgets doing it. The following is a simple way to reproduce it: # mkfs.btrfs /dev/sdc2 # mount /dev/sdc2 /test4 # touch /test4/a # dd if=/dev/zero of=/test4/a seek=8 count=1 bs=4K oflag=direct conv=nocreat,notrunc # umount /test4 # btrfsck /dev/sdc2 root 5 inode 257 errors 100 Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Miao Xie authored
The function - btrfs_cont_expand() forgot to close the transaction handle before it jump out the while loop. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Liu Bo authored
At the beginning of create_pending_snapshot, trans->block_rsv is set to pending->block_rsv and is used for snapshot things, however, when it is done, we do not recover it as will. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Liu Bo authored
While truncating free space cache, we forget to change trans->block_rsv back to the original one, but leave it with the orphan_block_rsv, and then with option inode_cache enable, it leads to countless warnings of btrfs_alloc_free_block and btrfs_orphan_commit_root: WARNING: at fs/btrfs/extent-tree.c:5711 btrfs_alloc_free_block+0x180/0x350 [btrfs]() ... WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]() Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
It's not enough to just search the commit root, since we could be cow'ing the very block we need to search through, which would mean that its locked and we'll still deadlock. So use path->skip_locking as well. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-
Sergei Trofimovich authored
iput() shouldn't be called for inodes in I_NEW state. We need to mark inode as constructed first. WARNING: at fs/inode.c:1309 iput+0x20b/0x210() Call Trace: [<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0 [<ffffffff8103e805>] warn_slowpath_null+0x15/0x20 [<ffffffff810eaf0b>] iput+0x20b/0x210 [<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0 [<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210 [<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0 [<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0 [<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280 [<ffffffff8105af86>] kthread+0x96/0xa0 [<ffffffff814336d4>] kernel_thread_helper+0x4/0x10 [<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190 [<ffffffff814336d0>] ? gs_change+0xb/0xb Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> CC: Konstantin Khlebnikov <khlebnikov@openvz.org> Tested-by: David Sterba <dsterba@suse.cz> CC: Josef Bacik <josef@redhat.com> CC: Chris Mason <chris.mason@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
-