Commits · 734db6899abd5a278c3826a301c49c57da3aea62 · Kirill Smelkov / linux

18 Jun, 2004 4 commits

[PATCH] reiserfs: block allocator optimizations · 734db689

Chris Mason authored 20 years ago

From: <mason@suse.com>
From: <jeffm@suse.com>

The current reiserfs allocator pretty much allocates things sequentially
from the start of the disk, it works very nicely for desktop loads but
once you've got more then one proc doing io data files can fragment badly.

One obvious solution is something like ext2's bitmap groups, which puts
file data into different areas of the disk based on which subdirectory
they are in.  The problem with bitmap groups is that if you've got a
group of subdirectories their contents will be spread out all over the
disk, leading to lots of seeks during a sequential read.

This allocator patch uses the packing locality to determine which bitmap
group to allocate from, but when you create a file it looks in the bitmaps
to see how 'full' that packing locality already is.  If it hasn't been
heavily used yet, the packing locality is inherited from the parent
directory putting files in new subdirs close to the parent subdir,
otherwise it is the inode number of the parent directory putting new
files far away from the parent subdir.

The end result is fewer bitmap groups for the same working set.  For
example, one test data set created by 20 procs running in parallel has
6822 subdirs.  And with vanilla reiserfs that would mean 6822
packing localities.  This patch turns that into 26 packing localities.

This makes sequential reads of big directory trees more efficient, but
it also makes the btree more efficient in general.  Things end up sorted
better because groups of subdirs end up with similar keys in the btree,
instead of being spread out all over.

The bitmap grouping code tries to use the start of each bitmap group
for metadata, and offsets the data slightly.  The data and metadata
are still close together, but not completely intermixed like they are
in the default allocator.  The end result is that leaf nodes tend to be
close to each other, making metadata readahead more effective.

The old block allocator had the ability to enforce a minimum
allocation size, but did not use it.  It now tries to do a pass looking
for larger allocation chunks before falling back to the old behaviour
of taking any blocks it can find.

The patch changes the defaults to:

mount -o alloc=skip_busy:dirid_groups:packing_groups

You can get back the old behaviour with mount -o alloc=skip_busy

mount -o alloc=dirid_groups will turn on the bitmap groups
mount -o alloc=packing_groups turns on the packing locality reduction code
mount -o alloc=skip_busy:dirid_groups turns on both dirid_groups and
skip_busy

Finally the patch adds a mount -o alloc=oid_groups, which puts files into
bitmap groups based on a hash of their objectid.  This would be used for
databases or other situations where you have a limited number of very
large files.

This command will tell you how many packing localities are actually in
use:

debugreiserfs -d /dev/xxx | grep '^|.*SD' | sed 's/^.....//' | awk '{print $1}' | sort -u | wc -l
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

734db689

[PATCH] Clean up asm/pgalloc.h include · 1c60f076

Russell King authored 20 years ago


This patch cleans up needless includes of asm/pgalloc.h from the fs/
kernel/ and mm/ subtrees.  Compile tested on multiple ARM platforms, and
x86, this patch appears safe.

This patch is part of a larger patch aiming towards getting the include of
asm/pgtable.h out of linux/mm.h, so that asm/pgtable.h can sanely get at
things like mm_struct and friends.

I suggest testing in -mm for a while to ensure there aren't any hidden arch
issues.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

1c60f076

[PATCH] binfmt_misc: improve calculation of interpreter's credentials · c407c033

Yoav Zach authored 20 years ago


This patch allows for misc binaries to run with credentials and security
token that are calculated according to the binaries, and not according to the
interpreter, which is the legacy behavior of binfmt_misc.

The way it is done is by calling prepare_binprm, which is where these
attributes are calculated, before switching the 'file' field in the bprm from
the binary to the interpreter.

This feature should be used with care, since the interpreter will have root
permissions when running a setuid binary owned by root.

Please note -

- Only root can register an interpreter with binfmt_misc.  The feature is
  documented and the administrator is advised to handle it with care

- The new feature is enabled only with a special flag in the registration
  string.  When this flag is not specified the current behavior of
  binfmt_misc is kept

- This is the only 'right' way for an interpreter to know the correct
  AT_SECURE value for the interpreted binary


From: Chris Wright <chrisw@osdl.org>

  This patchset looks OK, except for one problem.  It installs the fd (which
  could've been unreadable) without unsharing the ->files.  So someone can use
  this to read unreadable yet executable files.  Here's a patch which fixes
  that up.  I added one bit that's commented out because I'm not positive if a
  final steal_locks() is needed.

  I did a fair amount of rearranging to simplify the error conditions
  relative to the fd_install(), and unshare_files().

From: Chris Wright <chrisw@osdl.org>

  I found that the intel patchset (and mine as well) leaked i_writecount on
  the original executed file.  In addition, I verified that the steal_locks()
  bit is indeed needed.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

c407c033

[PATCH] Handle non-readable binfmt_misc executables · 79baf43b

Yoav Zach authored 20 years ago

<background>

I work in a group that works on enabling the IA-32 Execution Layer
(http://www.intel.com/pressroom/archive/releases/20040113comp.htm) on Linux.
In a few words - this is a dynamic translator for IA-32 binaries on IPF
platform.  Following David Mosberger's advice - we use the binfmt_misc
mechanism for the invocation of the translator whenever the user tries to
exec an IA-32 binary.

The EL is meant to help in the migration path from IA-32 to IPF.  From our
beta customers we learnt that at first stage - they tend to keep their
environment mostly intact, using the legacy IA-32 binaries.

Such an environment has, naturally, setuid and non-readable binaries.  It
will be useless to ask the administrator to change the settings of such an
environment - some of them are very complex, and the administrators are
reluctant to make any changes in a system that already proved itself to be
robust and secure.  So, our target with these patches is not to enhance the
support for scripts but rather to allow a translator to be integrated into a
working environment that is not (and should not be) aware to the fact it's
being emulated.

As I said before - it is practically hopeless to expect an administrator of
such a system to change it so that it will suit the current behavior of
binfmt_misc.  But, even if we could do that,

I'm not sure it would be a good idea - these changes are likely to be less
secure than the suggested patches -

- In order to execute non-readable binaries the binary will have to be made
  readable, which is obviously less secure than allowing only a trusted
  translator to read it

- There will be no way for the translator to calculate the accurate
  AT_SECURE value for the translated process.  This might end up with the
  translated process running in a non-secured mode when it actually needs to
  be secured.

</background>


I prepared a patch that solves a couple of problems that interpreters have
when invoked via binfmt_misc.  currently -

1) such interpreters cannot open non-readable binaries

2) the processes will have their credentials and security attributes
   calculated according to interpreter permissions and not those of the
   original binary

the proposed patch solves these problems by -

1) opening the binary on behalf of the interpreter and passing its fd
   instead of the path as argv[1] to the interpreter

2) calling prepare_binprm with the file struct of the binary and not the
   one of the interpreter

The new functionality is enabled by adding a special flag to the registration
string.  If this flag is not added then old behavior is not changed.

A preliminary version of this patch was sent to the list on 9/1/2003 with the
title "[PATCH]: non-readable binaries - binfmt_misc 2.6.0-test4".  This new
version fixes the concerns that were raised by the patch, except of calling
unshare_files() before allocating a new fd.  this is because this feature did
not enter 2.6 yet.


Arun Sharma <arun.sharma@intel.com> says:

We were going through an internal review of this patch:

http://marc.theaimsgroup.com/?l=linux-kernel&m=107424598901720&w=2



which is in your tree already.  I'm not sure if this line of code got
sufficient review.

+               /* call prepare_binprm before switching to interpreter's file
+                * so that all security calculation will be done according to
+                * binary and not interpreter */
+               retval = prepare_binprm(bprm);

The case that concerns me is: unprivileged interpreter and a privileged
binary.  One can use binfmt_misc to execute untrusted code (interpreter) with
elevated privileges.  One could argue that all binfmt_misc interpreters are
trusted, because only root can register them.  But that's a change from the
traditional behavior of binfmt_misc (and binfmt_script).


(Update):

Arun pointed out that calculating the process credentials according to the
binary that needs to be translated is a bit risky, since it requires the
administrator to pay extra attention not to register an interpreter which is
not intended to run with root credentials.

After discussing this issue with him, I would like to propose a modified
patch: The old patch did 2 things - 1) open the binary for reading and 2)
calculate the credentials according to the binary.

I removed the riskier part of changing the credentials calculation, so the
revised patch only opens the binary for reading.  It also includes few words
of warning in the description of the 'open-binary' feature in
binfmt_misc.txt, and makes the function entry_status print the flags in use.

As for the 'credentials' part of the patch, I will prepare a separate patch
for it and send it again to the LKML, describe the problem and ask for people
comments.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

79baf43b

15 Jun, 2004 1 commit
- Fix i_size corruption in case of overlapped readdir changing cached file size... · d31ea435
  Steve French authored 20 years ago
```
Fix i_size corruption in case of overlapped readdir changing cached file size and local cached write extending file 
```
  d31ea435
14 Jun, 2004 3 commits

fix fealist struct (xattr support part 3) · 50df063c
Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com)
```
50df063c

[PATCH] stat nlink resolution fix · 5b2785a1

Chris Wedgwood authored 20 years ago


Some filesystems can get overflows when their link-count exceeds
65534.  This patch increases the kernels internal resolution for this
and also has a check for the old-system call paths to return and error
(-EOVERFLOW) as required (as suggested by Al Viro).
Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

5b2785a1

[PATCH] Fix nfs writepage behaviour · 5241cac5

Andrew Morton authored 20 years ago


From: Nick Piggin <nickpiggin@yahoo.com.au>

nfs_writepage() refuses to write back mapped pages at all on the page
reclaim path, causing systems to get locked up when there's a lot of dirty
mmapped data around.  The patch changes NFS so that it will start I/O
against these pages.

The code as it stands is designed to defer writeout to pdflush which can do
larger, more efficient I/Os.  But there shouldn't be much traffic by this
path, and going slow is better than not going at all.

Patch originally from Trond.
Signed-off-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

5241cac5

13 Jun, 2004 3 commits
- fix listxattr error path · 05a46cc8
  Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com)
```
  05a46cc8
- do not filemap_fdatawrite when reconnecting in write to avoid potential deadlock · c1f8b629
  Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com)
```
  c1f8b629
- lock session when reconnecting so we do not oops in retrying sendmsg · e70aaeb1
  Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com)
```
  e70aaeb1
12 Jun, 2004 4 commits

[PATCH] spoll_create size check · 0d39f73b

Andrew Morton authored 20 years ago


From: Davide Libenzi <davidel@xmailserver.org>

This is a sanity check on the size parameter.  Nothing explodes w/out, but
the conversion to unsigned simply triggers a big allocation.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

0d39f73b

[PATCH] sync_inodes_sb() stack reduction · aa1df6ca

Andrew Morton authored 20 years ago


Reduce stack consumption in sync_inodes_sb() via read_page_state().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

aa1df6ca

remove compile warning · 6774af67
Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com)
```
6774af67
Extended Attributes part 1 · a44280c6
Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com)
```
a44280c6

11 Jun, 2004 3 commits
- Add missing EA info levels · e7bba9c2
  Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com)
```
  e7bba9c2
- Handle rename of hardlinked files properly (treat as a noop) · 62bbc304
  Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com)
```
  62bbc304
- flush write behind cached data, for files reopened after session reconnection after session drop · e3e741aa
  Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com)
```
  e3e741aa
10 Jun, 2004 2 commits

Fix sparse tool compile warnings for cifs · 7d83dc5e
Steve French authored 20 years ago
```
Signed-off-by: Steve French (sfrench@us.ibm.com>
```
7d83dc5e

NTFS: 2.1.14 - Fix an NFSd caused deadlock reported by several users. · 290a768a

Anton Altaparmakov authored 20 years ago


- Modify fs/ntfs/ntfs_readdir() to copy the index root attribute value
  to a buffer so that we can put the search context and unmap the mft
  record before calling the filldir() callback.  We need to do this
  because of NFSd which calls ->lookup() from its filldir callback()
  and this causes NTFS to deadlock as ntfs_lookup() maps the mft record
  of the directory and since ntfs_readdir() has got it mapped already
  ntfs_lookup() deadlocks.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>

290a768a

09 Jun, 2004 5 commits

JFS: Better RAS when btstack is overrun · fa2c79c3

Dave Kleikamp authored 20 years ago


The current warning and/or trap when the btstack is overrun in
dtSearch or dtReadFirst are not very helpful.  Add code to detect
the stack overrun earlier, print something useful, and return
gracefully.

I've found that dbFree being called with blkno == 0 can lead to this
error, so I put in a specific check for that.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>

fa2c79c3

[PATCH] aio.c sparse warning fix · 004a2b36

Andrew Morton authored 20 years ago


Randy Dunlap <rddunlap@osdl.org> points out that sparse warns about the test
of an undefined preprocessor identifier.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

004a2b36

[PATCH] ext3: journal_flush() needs journal_lock_updates() · b7d41b55

Andrew Morton authored 20 years ago


We need to take journal_lock_updates() while remounting r/o to prevent a new
transaction starting while journal_flush() is running.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

b7d41b55

[PATCH] writeback_inodes can race with unmount · 7052fc2b

Andrew Morton authored 20 years ago


From: Chris Mason <mason@suse.com>

There's a small window where the filesystem can be unmounted during
writeback_inodes.  The end result is the iput done by sync_sb_inodes could
be done after the FS put_super and and the super has been removed from all
lists.

The fix is to hold the s_umount sem during sync_sb_inodes to make sure
the FS doesn't get unmounted.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

7052fc2b

[PATCH] jbd: descriptor buffer state fix · ace476bb

Andrew Morton authored 20 years ago

Fix a problem discovered by Jeff Mahoney <jeffm@suse.com>, based on an initial
patch from Chris Mason <mason@suse.com>.

journal_get_descriptor_buffer() is used to obtain a regular old buffer_head
against the blockdev mapping. The caller will populate that bh by hand and
will then submit it for writing.

But there are problems:

a) The function sets bh->b_state nonatomically. But this buffer is
accessible to other CPUs via pagecache lookup.

b) The function sets the buffer dirty and then the caller populates it and
then it is submitted for I/O. Wrong order: there's a window in which the
VM could write the buffer before it is fully populated.

c) The function fails to set the buffer uptodate after zeroing it. And one
caller forgot to mark it uptodate as well. So if the VM happens to decide
to write the containing page back __block_write_full_page() encounters a
dirty, not uptodate buffer, which is an illegal state. This was generating
buffer_error() warnings before we removed buffer_error().

Leaving the buffer not uptodate also means that a concurrent reader of
/dev/hda1 could cause physical I/O against the buffer, scribbling on what
we just put in it.

So journal_get_descriptor_buffer() is changed to mark the buffer
uptodate, under the buffer lock.

I considered changing journal_get_descriptor_buffer() to return a locked
buffer but there doesn't seem to be a need for this, and both callers end up
using ll_rw_block() anyway, which requires that the buffer be unlocked again.

Note that the journal_get_descriptor_buffer() callers dirty these buffers with
set_buffer_dirty(). That's a bit naughty, because it could create dirty
buffers against a clean page - an illegal state. They really should use
mark_buffer_dirty() to dirty the page and inode as well. But all callers will
immediately write and clean the buffer anyway, so we can safely leave this
optimising cheat in place.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

ace476bb

08 Jun, 2004 1 commit

NTFS: 2.1.13 - Enable overwriting of resident files and housekeeping of system files. · 32e5fcaa

Anton Altaparmakov authored 20 years ago

- Mark the volume dirty when (re)mounting read-write and mark it clean
  when unmounting or remounting read-only.  If any volume errors are
  found, the volume is left marked dirty to force chkdsk to run.
- Add code to set the NT4 compatibility flag when (re)mounting
  read-write for newer NTFS versions but leave it commented out for now
  since we do not make any modifications that are NTFS 1.2 specific yet
  and since setting this flag breaks Captive-NTFS which is not nice.
  This code must be enabled once we start writing NTFS 1.2 specific
  changes otherwise Windows NTFS driver might crash / cause corruption.
- Fix a silly bug that caused a deadlock in ntfs_mft_writepage().
  For inode 0, i.e. $MFT itself, we cannot use ilookup5() from
  there because the inode is already locked by the kernel
  (fs/fs-writeback.c::__sync_single_inode()) and ilookup5() waits
  until the inode is unlocked before returning it and it never gets
  unlocked because ntfs_mft_writepage() never returns.  )-:
  Fortunately, we have inode 0 pinned in icache for the duration
  of the mount so we can access it directly.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>

32e5fcaa

07 Jun, 2004 6 commits
- handle partial page update of page in cache that is not uptodate better for... · d8c2ddb6
  Steve French authored 20 years ago
```
handle partial page update of page in cache that is not uptodate better for the situation in which file is open writeonly
Signed-off-by: Steve French <sfrench@us.ibm.com>
```
  d8c2ddb6
- Make stats display more consistent - under /proc/fs/cifs/Stats · e22fe382
  Steve French authored 20 years ago
```
Signed-off-by: Steve French <sfrench@us.ibm.com>
```
  e22fe382
- NTFS: Add functions ntfs_{clear,set}_volume_flags(), to modify the volume · 7510c432
  Anton Altaparmakov authored 20 years ago
```
      information flags (fs/ntfs/super.c).
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
```
  7510c432
- fix up whitespace · 665ef081
  Steve French authored 20 years ago
```
Signed-off-by: Steve French <sfrench@us.ibm.com>
```
  665ef081
- Add 2 missing kmalloc failure checks during cifs mount time · dbc2f102
  Steve French authored 20 years ago
```
Signed-off-by: Yury Umanets <torque@ukrpost.net>
Signed-off-by: Steve French <sfrench@us.ibm.com>
```
  dbc2f102
- Fix race in updating tcpStatus field · f2ba8b3c
  Steve French authored 20 years ago
  
  f2ba8b3c
05 Jun, 2004 8 commits

[PATCH] nfs-direct warning fix · 8498dc03

Andrew Morton authored 20 years ago


fs/nfs/direct.c: In function `nfs_file_direct_write':
fs/nfs/direct.c:549: warning: initialization discards qualifiers from pointer target type
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

8498dc03

[PATCH] fix sysfs node cpumap for large NR_CPUS · b431aa18

Rusty Russell authored 20 years ago


As pointed out by Paul Jackson <pj@sgi.com>, sometimes 99 chars is not enough.
We currently get a page from sysfs: that code should check we haven't overrun
it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>