- 12 Feb, 2019 30 commits
-
-
Brian Foster authored
Most verifiers that check on-disk magic values convert the CPU endian magic value constant to disk endian to facilitate compile time optimization of the byte swap and reduce the need for runtime byte swaps in buffer verifiers. Several buffer verifiers do not follow this pattern. Update those verifiers for consistency. Also fix up a random typo in the inode readahead verifier name. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Brian Foster authored
Improve the documentation around xfs_buf_ensure_ops, which is the function that is responsible for cleaning up the b_ops state of buffers that go through xrep_findroot_block but don't match anything. Rename the function to xfs_buf_reverify. [darrick: this started off as bfoster mods of a previous patch of mine, but the renaming part is now this separate patch.] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Use a rhashtable to cache the unlinked list incore. This should speed up unlinked processing considerably when there are a lot of inodes on the unlinked list because iunlink_remove no longer has to traverse an entire bucket list to find which inode points to the one being removed. The incore list structure records "X.next_unlinked = Y" relations, with the rhashtable using Y to index the records. This makes finding the inode X that points to a inode Y very quick. If our cache fails to find anything we can always fall back on the old method. FWIW this drastically reduces the amount of time it takes to remove inodes from the unlinked list. I wrote a program to open a lot of O_TMPFILE files and then close them in the same order, which takes a very long time if we have to traverse the unlinked lists. With the ptach, I see: + /d/t/tmpfile/tmpfile Opened 193531 files in 6.33s. Closed 193531 files in 5.86s real 0m12.192s user 0m0.064s sys 0m11.619s + cd / + umount /mnt real 0m0.050s user 0m0.004s sys 0m0.030s And without the patch: + /d/t/tmpfile/tmpfile Opened 193588 files in 6.35s. Closed 193588 files in 751.61s real 12m38.853s user 0m0.084s sys 12m34.470s + cd / + umount /mnt real 0m0.086s user 0m0.000s sys 0m0.060s Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Add tracepoints so we can associate high level operations with low level updates. No functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
In xfs_iunlink_remove we have two identical calls to xfs_iunlink_update_inode, so move it out of the if statement to simplify the code some more. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
There's a loop that searches an unlinked bucket list to find the inode that points to a given inode. Hoist this into a separate function; later we'll use our iunlink backref cache to bypass the slow list operation. No functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Hoist the functions that update an inode's unlinked pointer updates into a helper. No functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Strengthen our checking of the AGI unlinked pointers when we start to use them for updating the metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Split the AGI unlinked bucket updates into a separate function. No functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Add a new helper to check that a per-AG inode pointer is either null or points somewhere valid within that AG. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Fix some indentation issues with the iunlink functions and reorganize the tops of the functions to be identical. No functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Brian Foster authored
The writeback delalloc conversion code is racy with respect to changes in the currently cached file mapping outside of the current page. This is because the ilock is cycled between the time the caller originally looked up the mapping and across each real allocation of the provided file range. This code has collected various hacks over the years to help combat the symptoms of these races (i.e., truncate race detection, allocation into hole detection, etc.), but none address the fundamental problem that the imap may not be valid at allocation time. Rather than continue to use race detection hacks, update writeback delalloc conversion to a model that explicitly converts the delalloc extent backing the current file offset being processed. The current file offset is the only block we can trust to remain once the ilock is dropped because any operation that can remove the block (truncate, hole punch, etc.) must flush and discard pagecache pages first. Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc() mechanism to request allocation of the entire delalloc extent backing the current offset instead of assuming the extent passed by the caller is unchanged. Record the range specified by the caller and apply it to the resulting allocated extent so previous checks by the caller for COW fork overlap are not lost. Finally, overload the bmapi delalloc flag with the range reval flag behavior since this is the only use case for both. This ensures that writeback always picks up the correct and current extent associated with the page, regardless of races with other extent modifying operations. If operating on a data fork and the COW overlap state has changed since the ilock was cycled, the caller revalidates against the COW fork sequence number before using the imap for the next block. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Brian Foster authored
The writeback delalloc conversion code is racy with respect to changes in the currently cached file mapping. This stems from the fact that the bmapi allocation code requires a file range to allocate and the writeback conversion code assumes the range of the currently cached mapping is still valid with respect to the fork. It may not be valid, however, because the ilock is cycled (potentially multiple times) between the time the cached mapping was populated and the delalloc conversion occurs. To facilitate a solution to this problem, create a new xfs_bmapi_delalloc() wrapper to xfs_bmapi_write() that takes a file (FSB) offset and attempts to allocate whatever delalloc extent backs the offset. Use a new bmapi flag to cause xfs_bmapi_write() to set the range based on the extent backing the bno parameter unless bno lands in a hole. If bno does land in a hole, fall back to the current behavior (which may result in an error or quietly skipping holes in the specified range depending on other parameters). This patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Brian Foster authored
Now that the cached writeback mapping is explicitly invalidated on data fork changes, the EOF trimming band-aid is no longer necessary. Remove xfs_trim_extent_eof() as well since it has no other users. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Brian Foster authored
The writeback code caches the current extent mapping across multiple xfs_do_writepage() calls to avoid repeated lookups for sequential pages backed by the same extent. This is known to be slightly racy with extent fork changes in certain difficult to reproduce scenarios. The cached extent is trimmed to within EOF to help avoid the most common vector for this problem via speculative preallocation management, but this is a band-aid that does not address the fundamental problem. Now that we have an xfs_ifork sequence counter mechanism used to facilitate COW writeback, we can use the same mechanism to validate consistency between the data fork and cached writeback mappings. On its face, this is somewhat of a big hammer approach because any change to the data fork invalidates any mapping currently cached by a writeback in progress regardless of whether the data fork change overlaps with the range under writeback. In practice, however, the impact of this approach is minimal in most cases. First, data fork changes (delayed allocations) caused by sustained sequential buffered writes are amortized across speculative preallocations. This means that a cached mapping won't be invalidated by each buffered write of a common file copy workload, but rather only on less frequent allocation events. Second, the extent tree is always entirely in-core so an additional lookup of a usable extent mostly costs a shared ilock cycle and in-memory tree lookup. This means that a cached mapping reval is relatively cheap compared to the I/O itself. Third, spurious invalidations don't impact ioend construction. This means that even if the same extent is revalidated multiple times across multiple writepage instances, we still construct and submit the same size ioend (and bio) if the blocks are physically contiguous. Update struct xfs_writepage_ctx with a new field to hold the sequence number of the data fork associated with the currently cached mapping. Check the wpc seqno against the data fork when the mapping is validated and reestablish the mapping whenever the fork has changed since the mapping was cached. This ensures that writeback always uses a valid extent mapping and thus prevents lost writebacks and stale delalloc block problems. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Brian Foster authored
The sequence counter in the xfs_ifork structure is only updated on COW forks. This is because the counter is currently only used to optimize out repetitive COW fork checks at writeback time. Tweak the extent code to update the seq counter regardless of the fork type in preparation for using this counter on data forks as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Marco Benatto authored
Currently we have a few PTAGs in place allowing us to transform a filesystem error in a BUG() call. However, we don't have a panic tag for corrupt metadata, so introduce XFS_PTAG_VERIFIER_ERROR so that the administrator can use the fs.xfs.panic_mask sysctl knob to convert any error detected by buffer verifiers into a kernel panic. Signed-off-by: Marco Benatto <mbenatto@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> [darrick: light editing of commit message] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
YueHaibing authored
Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Darrick J. Wong authored
Check extended attribute entry names for invalid characters. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Check directory entry names for invalid characters. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Fix an off-by-one error in the realtime bitmap "is used" cross-reference helper function if the realtime extent size is a single block. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Teach scrub to flag extent maps that exceed the range that can be mapped with a xfs_dablk_t. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
The extended attribute scrubber should abort the "read all attrs" loop if there's a fatal signal pending on the process. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Move all the confusing dinode mapping code that's split between xchk_iallocbt_check_cluster and xchk_iallocbt_check_cluster_ifree into the first function so that it's clearer how we find the dinode for a given inode. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Teach scrub how to handle the case that there are one or more inobt records covering a given inode cluster. This fixes the operation on big block filesystems (e.g. 64k blocks, 512 byte inodes). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
The code to check inobt records against inode clusters is a mess of poorly named variables and unnecessary parameters. Clean the unnecessary inode number parameters out of _check_cluster_freemask in favor of computing them inside the function instead of making the caller do it. In xchk_iallocbt_check_cluster, rename the variables to make it more obvious just what chunk_ino and cluster_ino represent. Add a tracepoint to make it easier to track each inode cluster as we scrub it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Hoist the inode cluster checks out of the inobt record check loop into a separate function in preparation for refactoring of that loop. No functional changes here; that's in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
On a big block filesystem, there may be multiple inobt records covering a single inode cluster. These records obviously won't be aligned to cluster alignment rules, and they must cover the entire cluster. Teach scrub to check for these things. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
In xchk_iallocbt_rec, check the alignment of ir_startino by converting the inode cluster block alignment into units of inodes instead of the other way around (converting ir_startino to blocks). This prevents us from tripping over off-by-one errors in ir_startino which are obscured by the inode -> block conversion. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Make sure we never check more than XFS_INODES_PER_CHUNK inodes for any given inobt record since there can be more than one inobt record mapped to an inode cluster. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
- 10 Feb, 2019 7 commits
-
-
Linus Torvalds authored
-
git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds authored
Pull dmaengine fixes from Vinod Koul: - Fix in at_xdmac fr wrongful channel state - Fix for imx driver for wrong callback invocation - Fix to bcm driver for interrupt race & transaction abort. - Fix in dmatest to abort in mapping error * tag 'dmaengine-fix-5.0-rc6' of git://git.infradead.org/users/vkoul/slave-dma: dmaengine: dmatest: Abort test in case of mapping error dmaengine: bcm2835: Fix abort of transactions dmaengine: bcm2835: Fix interrupt race on RT dmaengine: imx-dma: fix wrong callback invoke dmaengine: at_xdmac: Fix wrongfull report of a channel as in use
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull x86 fixes from Ingo Molnar: "A handful of fixes: - Fix an MCE corner case bug/crash found via MCE injection testing - Fix 5-level paging boot crash - Fix MCE recovery cache invalidation bug - Fix regression on Xen guests caused by a recent PMD level mremap speedup optimization" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Make set_pmd_at() paravirt aware x86/mm/cpa: Fix set_mce_nospec() x86/boot/compressed/64: Do not corrupt EDX on EFER.LME=1 setting x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull irq fixes from Ingo Molnar: "irqchip driver fixes: most of them are race fixes for ARM GIC (General Interrupt Controller) variants, but also a fix for the ARM MMP (Marvell PXA168 et al) irqchip affecting OLPC keyboards" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Fix ITT_entry_size accessor irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable irqchip/gic-v3-its: Gracefully fail on LPI exhaustion irqchip/gic-v3-its: Plug allocation race for devices sharing a DevID irqchip/gic-v4: Fix occasional VLPI drop
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull perf fixes from Ingo Molnar: "A couple of kernel side fixes: - Fix the Intel uncore driver on certain hardware configurations - Fix a CPU hotplug related memory allocation bug - Remove a spurious WARN() ... plus also a handful of perf tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf script python: Add Python3 support to tests/attr.py perf trace: Support multiple "vfs_getname" probes perf symbols: Filter out hidden symbols from labels perf symbols: Add fallback definitions for GELF_ST_VISIBILITY() tools headers uapi: Sync linux/in.h copy from the kernel sources perf clang: Do not use 'return std::move(something)' perf mem/c2c: Fix perf_mem_events to support powerpc perf tests evsel-tp-sched: Fix bitwise operator perf/core: Don't WARN() for impossible ring-buffer sizes perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu() perf/x86/intel/uncore: Add Node ID mask
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull locking fixes from Ingo Molnar: "An rtmutex (PI-futex) deadlock scenario fix, plus a locking documentation fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Handle early deadlock return correctly futex: Fix barrier comment
-
Juergen Gross authored
set_pmd_at() calls native_set_pmd() unconditionally on x86. This was fine as long as only huge page entries were written via set_pmd_at(), as Xen pv guests don't support those. Commit 2c91bd4a ("mm: speed up mremap by 20x on large regions") introduced a usage of set_pmd_at() possible on pv guests, leading to failures like: BUG: unable to handle kernel paging request at ffff888023e26778 #PF error: [PROT] [WRITE] RIP: e030:move_page_tables+0x7c1/0xae0 move_vma.isra.3+0xd1/0x2d0 __se_sys_mremap+0x3c6/0x5b0 do_syscall_64+0x49/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Make set_pmd_at() paravirt aware by just letting it use set_pmd(). Fixes: 2c91bd4a ("mm: speed up mremap by 20x on large regions") Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Cc: boris.ostrovsky@oracle.com Cc: sstabellini@kernel.org Cc: hpa@zytor.com Cc: bp@alien8.de Cc: torvalds@linux-foundation.org Link: https://lkml.kernel.org/r/20190210074056.11842-1-jgross@suse.com
-
- 09 Feb, 2019 3 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linuxLinus Torvalds authored
Pull i2c fixes from Wolfram Sang: "One PM related driver bugfix and a MAINTAINERS update" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: MAINTAINERS: Update the ocores i2c bus driver maintainer, etc i2c: omap: Use noirq system sleep pm ops to idle device for suspend
-
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linuxLinus Torvalds authored
Pull MIPS fixes from Paul Burton: "A batch of MIPS fixes for 5.0, nothing too scary. - A workaround for a Loongson 3 CPU bug is the biggest change, but still fairly straightforward. It adds extra memory barriers (sync instructions) around atomics to avoid a CPU bug that can break atomicity. - Loongson64 also sees a fix for powering off some systems which would incorrectly reboot rather than waiting for the power down sequence to complete. - We have DT fixes for the Ingenic JZ4740 SoC & the JZ4780-based Ci20 board, and a DT warning fix for the Nexsys4/MIPSfpga board. - The Cavium Octeon platform sees a further fix to the behaviour of the pcie_disable command line argument that was introduced in v3.3. - The VDSO, introduced in v4.4, sees build fixes for configurations of GCC that were built using the --with-fp-32= flag to specify a default 32-bit floating point ABI. - get_frame_info() sees a fix for configurations with CONFIG_KALLSYMS=n, for which it previously always returned an error. - If the MIPS Coherence Manager (CM) reports an error then we'll now clear that error correctly so that the GCR_ERROR_CAUSE register will be updated with information about any future errors" * tag 'mips_fixes_5.0_3' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: mips: cm: reprime error cause mips: loongson64: remove unreachable(), fix loongson_poweroff(). MIPS: Remove function size check in get_frame_info() MIPS: Use lower case for addresses in nexys4ddr.dts MIPS: Loongson: Introduce and use loongson_llsc_mb() MIPS: VDSO: Include $(ccflags-vdso) in o32,n32 .lds builds MIPS: VDSO: Use same -m%-float cflag as the kernel proper MIPS: OCTEON: don't set octeon_dma_bar_type if PCI is disabled DTS: CI20: Fix bugs in ci20's device tree. MIPS: DTS: jz4740: Correct interrupt number of DMA core
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull block fixes from Jens Axboe: - NVMe pull request from Christoph, fixing namespace locking when dealing with the effects log, and a rapid add/remove issue (Keith) - blktrace tweak, ensuring requests with -1 sectors are shown (Jan) - link power management quirk for a Smasung SSD (Hans) - m68k nfblock dynamic major number fix (Chengguang) - series fixing blk-iolatency inflight counter issue (Liu) - ensure that we clear ->private when setting up the aio kiocb (Mike) - __find_get_block_slow() rate limit print (Tetsuo) * tag 'for-linus-20190209' of git://git.kernel.dk/linux-block: blk-mq: remove duplicated definition of blk_mq_freeze_queue Blk-iolatency: warn on negative inflight IO counter blk-iolatency: fix IO hang due to negative inflight counter blktrace: Show requests without sector fs: ratelimit __find_get_block_slow() failure message. m68k: set proper major_num when specifying module param major_num libata: Add NOLPM quirk for SAMSUNG MZ7TE512HMHP-000L1 SSD nvme-pci: fix rapid add remove sequence nvme: lock NS list changes while handling command effects aio: initialize kiocb private in case any filesystems expect it.
-