- 12 Apr, 2004 40 commits
-
-
Andrew Morton authored
Juggle dirty pages and dirty inodes and dirty superblocks and various different writeback modes and livelock avoidance and fairness to recover from the loss of mapping->io_pages.
-
Andrew Morton authored
Move everything over to walking the radix tree via the PAGECACHE_TAG_DIRTY tag. Remove address_space.dirty_pages.
-
Andrew Morton authored
Arrange for under-writeback pages to be marked thus in their pagecache radix tree.
-
Andrew Morton authored
Arrange for all dirty pagecache pages to be tagged as dirty within their radix tree.
-
Andrew Morton authored
Intro to these patches: - Major surgery against the pagecache, radix-tree and writeback code. This work is to address the O_DIRECT-vs-buffered data exposure horrors which we've been struggling with for months. As a side-effect, 32 bytes are saved from struct inode and eight bytes are removed from struct page. At a cost of approximately 2.5 bits per page in the radix tree nodes on 4k pagesize, assuming the pagecache is densely populated. Not all pages are pagecache; other pages gain the full 8 byte saving. This change will break any arch code which is using page->list and will also break any arch code which is using page->lru of memory which was obtained from slab. The basic problem which we (mainly Daniel McNeil) have been struggling with is in getting a really reliable fsync() across the page lists while other processes are performing writeback against the same file. It's like juggling four bars of wet soap with your eyes shut while someone is whacking you with a baseball bat. Daniel pretty much has the problem plugged but I suspect that's just because we don't have testcases to trigger the remaining problems. The complexity and additional locking which those patches add is worrisome. So the approach taken here is to remove the page lists altogether and replace the list-based writeback and wait operations with in-order radix-tree walks. The radix-tree code has been enhanced to support "tagging" of pages, for later searches for pages which have a particular tag set. This means that we can ask the radix tree code "find me the next 16 dirty pages starting at pagecache index N" and it will do that in O(log64(N)) time. This affects I/O scheduling potentially quite significantly. It is no longer the case that the kernel will submit pages for I/O in the order in which the application dirtied them. We instead submit them in file-offset order all the time. This is likely to be advantageous when applications are seeking all over a large file randomly writing small amounts of data. I haven't performed much benchmarking, but tiobench random write throughput seems to be increased by 30%. Other tests appear to be unaltered. dbench may have got 10-20% quicker, but it's variable. There is one large file which everyone seeks all over randomly writing small amounts of data: the blockdev mapping which caches filesystem metadata. The kernel's IO submission patterns for this are now ideal. Because writeback and wait-for-writeback use a tree walk instead of a list walk they are no longer livelockable. This probably means that we no longer need to hold i_sem across O_SYNC writes and perhaps fsync() and fdatasync(). This may be beneficial for databases: multiple processes writing and syncing different parts of the same file at the same time can now all submit and wait upon writes to just their own little bit of the file, so we can get a lot more data into the queues. It is trivial to implement a part-file-fdatasync() as well, so applications can say "sync the file from byte N to byte M", and multiple applications can do this concurrently. This is easy for ext2 filesystems, but probably needs lots of work for data-journalled filesystems and XFS and it probably doesn't offer much benefit over an i_semless O_SYNC write. These patches can end up making ext3 (even) slower: for i in 1 2 3 4 do dd if=/dev/zero of=$i bs=1M count=2000 & done runs awfully slow on SMP. This is, yet again, because all the file blocks are jumbled up and the per-file linear writeout causes tons of seeking. The above test runs sweetly on UP because the on UP we don't allocate blocks to different files in parallel. Mingming and Badari are working on getting block reservation working for ext3 (preallocation on steroids). That should fix ext3 up. This patch: - Later, we'll need to access the radix trees from inside disk I/O completion handlers. So make mapping->page_lock irq-safe. And rename it to tree_lock to reliably break any missed conversions.
-
Andrew Morton authored
Add radix-tree tagging so we can look up dirty or writeback pages in O(log64(n)) time. Each radix-tree node gains two bits for each slot: one for page dirtiness and one for page writebackness. If a tag bit is set on a leaf node, it indicates that item at the corresponding slot is tagged (say, a dirty page). If a tag bit is set in a non-leaf node it indicates that the same tag bit is set in the subtree which lies under the corresponding slot. ie: "there is a dirty page under here somewhere, but you need to search down further to find it". A gang lookup function is provided which can walk the radix tree in logarithmic time looking for items which are tagged, starting from a specified offset. We use this for in-order searches for dirty or writeback pages. There is a userspace test harness for this code at http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz
-
Andrew Morton authored
This function is setting page->mapping = swapper_space, but isn't actually adding the page to swapcache. This triggers soon-to-be-added BUGs in the radix tree code. So temporarily add these pages to swapcache for real. Also, make rw_swap_page_sync() go away if it has no callers.
-
Andrew Morton authored
From: Suparna Bhattacharya <suparna@in.ibm.com>, Daniel McNeil <daniel@osdl.org> This patch ensures that when the DIO code falls back to buffered i/o after having submitted part of the i/o, then buffered i/o is issued only for the remaining part of the request (i.e. the part not already covered by DIO), rather than redo the entire i/o. Now, instead of returning written == -ENOTBLK, generic_file_direct_IO returns the number of bytes already handled by DIO, so that the caller knows how much of the I/O is left to be handled via fallback to buffered write. We need to careful not to access dio fields if its possible that the dio could already have been freed asynchronously during i/o completion. A tricky part of this involves plugging the window between the decrement of bio_count and accessing dio->waiter during i/o completion where the dio could get freed by the submission path. This potential "bio_count race" was tackled (by Daniel) by changing bio_list_lock into bio_lock and using that for all the bio fields. Now bio_count and bios_in_flight have been converted from atomics into int and are both protected by the bio_lock. The race in finished_one_bio() could thus be fixed by leaving the bio_count at 1 until after the dio_complete() and then doing the bio_count decrement and wakeup holding the bio_lock. It appears that shifting to the spin_lock instead of atomic_inc/decs is ok performance wise as well. Update: An AIO O_DIRECT request was extending the file so it was done synchronously. However, the request got an EFAULT and direct_io_worker() was calling aio_complete() on the iocb and returning the EFAULT. When io_submit_one() got the EFAULT return, it assume it had to call aio_complete() since the i/o never got queued. The fix is for direct_io_worker() to only call aio_complete() when the upper layer is going to return -EIOCBQUEUED and not when getting errors that are being return to the submit path.
-
Andrew Morton authored
From: Suparna Bhattacharya <suparna@in.ibm.com> Fixes the following remaining issues with the DIO code: 1. During DIO file extends, intermediate writes could extend i_size exposing unwritten blocks to intermediate reads (Soln: Don't drop i_sem for file extends) 2. AIO-DIO file extends may update i_size before I/O completes, exposing unwritten blocks to intermediate reads. (Soln: Force AIO-DIO file extends to be synchronous) 3. AIO-DIO writes to holes call aio_complete() before falling back to buffered I/O ! (Soln: Avoid calling aio_complete() if -ENOTBLK) 4. AIO-DIO writes to an allocated region followed by a hole, falls back to buffered i/o without waiting for already submitted i/o to complete; might return to user-space, which could overwrite the buffer contents while they are still being written out by the kernel (Soln: Always wait for submitted i/o to complete before falling back to buffered i/o)
-
Andrew Morton authored
From: Badari Pulavarty <pbadari@us.ibm.com> 1) blkdev_direct_IO() calls blockdev_direct_IO() instead of blockdev_direct_IO_no_locking(). 2) writev entry point is generic_file_writev() which grabs i_sem. It should use generic_file_write_nolock() instead.
-
Andrew Morton authored
Fix a race which was identified by Daniel McNeil <daniel@osdl.org> If a buffer_head is under I/O due to JBD's ordered data writeout (which uses ll_rw_block()) then either filemap_fdatawrite() or filemap_fdatawait() need to wait on the buffer's existing I/O. Presently neither will do so, because __block_write_full_page() will not actually submit any I/O and will hence not mark the page as being under writeback. The best-performing fix would be to somehow mark the page as being under writeback and defer waiting for the ll_rw_block-initiated I/O until filemap_fdatawait()-time. But this is hard, because in __block_write_full_page() we do not have control of the buffer_head's end_io handler. Possibly we could make JBD call into end_buffer_async_write(), but that gets nasty. This patch makes __block_write_full_page() wait for any buffer_head I/O to complete before inspecting the buffer_head state. It only does this in the case where __block_write_full_page() was called for a "data-integrity" write: (wbc->sync_mode != WB_SYNC_NONE). Probably it doesn't matter, because kjournald is currently submitting (or has already submitted) all dirty buffers anyway.
-
Andrew Morton authored
From: Badari Pulavarty, Suparna Bhattacharya, Andrew Morton Forward port of Stephen Tweedie's DIO fixes from 2.4, to fix various DIO vs buffered IO exposures involving races causing: (a) stale data from uninstantiated blocks to be read, e.g. - O_DIRECT reads against buffered writes to a sparse region - O_DIRECT writes to a sparse region against buffered reads (b) potential data corruption with - O_DIRECT IOs against truncate due to writes to truncated blocks (which may have been reallocated to another file). Summary of fixes: 1) All the changes affect only regular files. RAW/O_DIRECT on block are unaffected. 2) The DIO code will not fill in sparse regions on a write. Instead -ENOTBLK is returned and the generic file write code would fallthrough to buffered IO in this case followed by writing through the pages to disk using filemap_fdatawrite/wait. 3) i_sem is held during both DIO reads and writes. For reads, and writes to already allocated blocks, it is released right after IO is issued, while for writes to newly allocated blocks (e.g file extending writes and hole overwrites) it is held all the way through until IO completes (and data is committed to disk). 4) filemap_fdatawrite/wait are called under i_sem to synchronize buffered pages to disk blocks before issuing DIO. 5) A new rwsem (i_alloc_sem) is held in shared mode all the while a DIO (read or write) is in progress, and in exclusive mode by truncate to guard against deallocation of data blocks during DIO. 6) All this new locking has been pushed down into blockdev_direct_IO to avoid interfering with NFS direct IO. The locks are taken in the order i_sem followed by i_alloc_sem. While i_sem may be released after IO submission in some cases, i_alloc_sem is held through until dio_complete (in the case of AIO-DIO this happens through the IO completion callback). 7) i_sem and i_alloc_sem are not held for the _nolock versions of write routines, as used by blockdev and XFS. Filesystems can specify the needs_special_locking parameter to __blockdev_direct_IO from their direct IO address space op accordingly. Note from Badari: Here is the locking (when needs_special_locking is true): (1) generic_file_*_write() holds i_sem (as before) and calls ->direct_IO(). blockdev_direct_IO gets i_alloc_sem and call direct_io_worker(). (2) generic_file_*_read() does not hold any locks. blockdev_direct_IO() gets i_sem and then i_alloc_sem and calls direct_io_worker() to do the work (3) direct_io_worker() does the work and drops i_sem after submitting IOs if appropriate and drops i_alloc_sem after completing IOs.
-
Andrew Morton authored
From: Matt Mackall <mpm@selenic.com> From: Zwane Mwaikambo <zwane@arm.linux.org.uk> This enables deep powersaving mode on Geode boxes.
-
Andrew Morton authored
From: Matt Mackall <mpm@selenic.com> drop quota array in inode struct if no quota support
-
Andrew Morton authored
From: Matt Mackall <mpm@selenic.com> The nswap and cnswap variables counters have never been incremented as Linux doesn't do task swapping.
-
Andrew Morton authored
From: Matt Mackall <mpm@selenic.com> Make CONFIG_EMBEDDED description more accurate
-
Andrew Morton authored
From: Christoph Hellwig <hch@lst.de> the maintainer doesn't response unfortauntely, but removing these from net_devices unconditionally is the 2.6 way to go, there's no more module refcounting on net devices.
-
Andrew Morton authored
From: "Luiz Fernando N. Capitulino" <lcapitulino@prefeitura.sp.gov.br> sound/oss/wavfront.c: At top level: sound/oss/wavfront.c:2498: warning: `errno' defined but not used
-
Andrew Morton authored
From: Christoph Hellwig <hch@lst.de> Kill magic ide/sound makedev scripts in scripts/. The userland MAKEDEV is the proper place and already has support for them.
-
Andrew Morton authored
From: Martin Schwidefsky <schwidefsky@de.ibm.com> Just found an small bug in pgalloc for s390*. Comparing notes with other architectures I found that pte_alloc_one is sick for alpha and sparc64 as well.
-
Andrew Morton authored
From: Stephen Smalley <sds@epoch.ncsc.mil> This patch fixes the type of the ssec pointer in the sk_free_security function. This has no current impact as the magic element is the top of each structure. Thanks to Chad Hanson of TCS for discovering the bug and submitting the patch.
-
Andrew Morton authored
From: "Luiz Fernando N. Capitulino" <lcapitulino@prefeitura.sp.gov.br> drivers/media/dvb/frontends/stv0299.c:356: warning: unused variable `i'
-
Andrew Morton authored
From: "Nguyen, Tom L" <tom.l.nguyen@intel.com> Adds MSI support for ia64. - Modified existing code in drivers/pci/msi.c and drivers/pci/msi.h to include MSI support on IA64 platform. - Based on the comments received from Zwane Mwaikambo and David Mosberger, this patch consolidates the vector allocators as assign_irq_vector(AUTO_ASSIGN) has the same semantics as ia64_alloc_vector() by converting the existing uses of ia64_alloc_vector() to assign_irq_vector(AUTO_ASSIGN). - Based on the comments received from Zwane Mwaikambo, this patch consolidates the semantics of vector allocator assign_irq_vector() in drivers/pci/msi.c into the relevant architecture's vector allocator assign_irq_vector() in arch/i386/kernel/io_apic.c. - Regarding vector allocation, this patch modifies the existing function assign_irq_vector() to maximize the number of allocated vectors to 188 before going -ENOSPC. - Based on your comments, this patch creates <asm-i386/msi.h>, <asm-ia64/msi.h> and <asm-x86_64/msi.h>, includes <asm/msi.h> from within drivers/pci/msi.h and then places all the code which is currently under ifdef in msi.h into the relevant architecture's <asm/msi.h> file. - Based on your comments, this patch places pci_vector_resources() in existing drivers/pci/msi.c in the relevant architecture implementations such as into arch/.../pci/irq.c.
-
Andrew Morton authored
From: James Cleverdon <jamesclv@us.ibm.com> Bump up MAX_MP_BUSSES for summit/generic subarch to cope with big IBM x440 systems.
-
Andrew Morton authored
From: James Cleverdon <jamesclv@us.ibm.com> Break out the definition of NR_IRQ_VECTORS, etc from irq_vectors.h into irq_vectors_limits.h, so we can change it per subarch without having code duplication for the rest of the file. Stick the same values back for mach-default, and override them for mach-summit/generic which needs bigger limits.
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> Agustin Martin <agmartin@debian.org> pointed out that this doesn't work: options ide-mod options="ide=nodma hdc=cdrom" The quotes are understood by kernel/params.c (ie. it skips over spaces inside them), but are not stripped before handing to the underlying function. They should be.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Kevin P. Fleming pointed out that the 2.6 tmpfs does not allow writing huge sparse files. This is an unintended side-effect of the strict memory commit changes: which should make no difference. The solution is to treat the tmpfs files (of variable size) and the shmem objects (of fixed size) differently: sounds nasty but works out well. The shmem objects follow the VM preallocation convention as before, but the tmpfs files revert to allocation on demand as a filesystem would. If there's not enough memory to write to a tmpfs hole, it is reported as -ENOSPC rather than -ENOMEM, so the mmap writer gets SIGBUS rather than everyone else getting OOM-killed.
-
Andrew Morton authored
From: William Lee Irwin III <wli@holomorphy.com> Chang bitmap_shift_left()/bitmap_shift_right() to have O(1) stackspace requirements. Given zeroed tail preconditions these implementations satisfy zeroed tail postconditions, which makes them compatible with whatever changes from Paul Jackson one may want to merge in the future. No particular effort was required to ensure this. A small (but hopefully forgiveable) cleanup is a spelling correction: s/bitmap_shift_write/bitmap_shift_right/ in one of the kerneldoc comments. The primary effect of the patch is to remove the MAX_BITMAP_BITS limitation, so restoring the NR_CPUS to be limited only by stackspace and slab allocator maximums. They also look vaguely more efficient than the current code, though as this was not done for performance reasons, no performance testing was done.
-
Andrew Morton authored
From: Marcelo Tosatti <marcelo.tosatti@cyclades.com> From: Alain Knaff <alain.knaff@lll.lu> This patch adds support for floppy disks whose sectors are numbered starting at 0 rather than 1 as usual disks would be. This format is used for some CP/M disks, and also for certain music samplers (such as Ensoniq Ensoniq EPS 16plus). In order to use it, you need an fdutils with the current patch from http://fdutils.linux.lu as well, and then do setfdrpm /dev/fd0 dd zerobased sect=10 or setfdprm /dev/fd0 hd zerobased sect. In addtion, the patch also fixes my email addresses. I no longer use pobox.com.
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> Brian Gerst's patch which moved __this_module out from module.h into the module post-processing had a side effect. genksyms didn't see the undefined symbols for modules without a module_init (or module_exit), and hence didn't generate a version for them, causing the kernel to be tainted. The simple solution is to always include the versions for these functions. Also includes two cleanups: 1) alloc_symbol is easier to use if it populates ->next for us. 2) add_exported_symbol should set owner to module, not head of module list (we don't use this field in entries in that list, fortunately).
-
Andrew Morton authored
From: Brian Gerst <bgerst@didntduck.org> Move the __this_module structure to the modpost code where it really belongs.
-
Andrew Morton authored
Eric Dumazet <dada1@cosmosbay.com> We can avoid evaluating `current' in a few places.
-
Andrew Morton authored
From: Heiko Ronsdorf <hero@persua.de> - Remvoe a volatile which causes a warning via module_param() - Remove an unused variable.
-
Andrew Morton authored
From: David Mosberger <davidm@napali.hpl.hp.com> Somebody recently pointed out a performance-anomaly to me where an unusual amount of time was being spent reading from /dev/urandom. The problem isn't really surprising as it happened only on >= 4-way machines and the random driver isn't terribly scalable the way it is written today. If scalability _really_ mattered, I suppose per-CPU data structures would be the way to go. However, I found that at least for 4-way machines, performance can be improved considerably with the attached patch. In particular, I saw the following performance on a 4-way ia64 machine: Test: 3 tasks running "dd if=/dev/urandom of=/dev/null bs=1024": throughput:
-
Andrew Morton authored
From: Mike Waychison <Michael.Waychison@Sun.COM> Export complete_all for module use.
-
Andrew Morton authored
From: Arjan van de Ven <arjanv@redhat.com> The patch below adds a few missing put_user()'s to the i810/i830 drm modules. Users reported oopses with 4g/4g split in action, and sparse annotations indeed found the offender in the function in question. I've kept the sparse __user annotations since those are generally useful anyway. I can't test it myself but a few people reported that the oopses went away so far.
-
Andrew Morton authored
From: Trivial Patch Monkey <trivial@rustcorp.com.au> From: Thomas Molina <tmolina@cablespeed.com>
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> These macros are redefined here. Previously definitions are in asm-ppc(64)/io.h
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> From: maximilian attems <janitor@sternwelten.at> Add the Monkey to SubmittingPatches.
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> From: colpatch@us.ibm.com The cpu_2_node[] array for i386 is initialized to all 0's, meaning that until modified at CPU bring-up, all CPUs are mapped to node 0. When CPUs are brought online, they are mapped to the appropriate node by various mechanisms, depending on the underlying hardware. When we unmap CPUs (hotplug time), we should return the mapping for the CPU that is going away to its original state, ie: 0. When this code was initially submitted, the misguided poster (me) made the mistake of putting a -1 in the cpu_2_node[] array for the CPU going away. This patch fixes this mistake, and allows code to get a valid node number for all valid CPU numbers. This is important, because most (if not all) callers do not error check the value returned by the cpu_to_node() macro, and they should not have to. The API specifies that a valid node number be returned for any valid CPU number.
-