Commits · 179b68bbaadd8296c537ac11f0f5e825c188bfa8 · Kirill Smelkov / linux

05 Jul, 2003 10 commits

[PATCH] Use kblockd for running request queues · 179b68bb

Andrew Morton authored Jul 04, 2003

Using keventd for running request_fns is risky because keventd itself can
block on disk I/O.  Use the new kblockd kernel threads for the generic
unplugging.

179b68bb

[PATCH] anticipatory I/O scheduler · 97ff29c2

Andrew Morton authored Jul 04, 2003

From: Nick Piggin <piggin@cyberone.com.au>

This is the core anticipatory IO scheduler.  There are nearly 100 changesets
in this and five months work.  I really cannot describe it fully here.

Major points:

- It works by recognising that reads are dependent: we don't know where the
  next read will occur, but it's probably close-by the previous one.  So once
  a read has completed we leave the disk idle, anticipating that a request
  for a nearby read will come in.

- There is read batching and write batching logic.

  - when we're servicing a batch of writes we will refuse to seek away
    for a read for some tens of milliseconds.  Then the write stream is
    preempted.

  - when we're servicing a batch of reads (via anticipation) we'll do
    that for some tens of milliseconds, then preempt.

- There are request deadlines, for latency and fairness.
  The oldest outstanding request is examined at regular intervals. If
  this request is older than a specific deadline, it will be the next
  one dispatched. This gives a good fairness heuristic while being simple
  because processes tend to have localised IO.


Just about all of the rest of the complexity involves an array of fixups
which prevent most of teh obvious failure modes with anticipation: trying to
not leave the disk head pointlessly idle.  Some of these algorithms are:

- Process tracking.  If the process whose read we are anticipating submits
  a write, abandon anticipation.

- Process exit tracking.  If the process whose read we are anticipating
  exits, abandon anticipation.

- Process IO history.  We accumulate statistical info on the process's
  recent IO patterns to aid in making decisions about how long to anticipate
  new reads.

  Currently thinktime and seek distance are tracked. Thinktime is the
  time between when a process's last request has completed and when it
  submits another one. Seek distance is simply the number of sectors
  between each read request. If either statistic becomes too high, the
  it isn't anticipated that the process will submit another read.

The above all means that we need a per-process "io context".  This is a fully
refcounted structure.  In this patch it is AS-only.  later we generalise it a
little so other IO schedulers could use the same framework.

- Requests are grouped as synchronous and asynchronous whereas deadline
  scheduler groups requests as reads and writes. This can provide better
  sync write performance, and may give better responsiveness with journalling
  filesystems (although we haven't done that yet).

  We currently detect synchronous writes by nastily setting PF_SYNCWRITE in
  current->flags.  The plan is to remove this later, and to propagate the
  sync hint from writeback_contol.sync_mode into bio->bi_flags thence into
  request->flags.  Once that is done, direct-io needs to set the BIO sync
  hint as well.

- There is also quite a bit of complexity gone into bashing TCQ into
  submission. Timing for a read batch is not started until the first read
  request actually completes. A read batch also does not start until all
  outstanding writes have completed.

AS is the default IO scheduler.  deadline may be chosen by booting with
"elevator=deadline".

There are a few reasons for retaining deadline:

- AS is often slower than deadline in random IO loads with large TCQ
  windows. The usual real world task here is OLTP database loads.

- deadline is presumably more stable.

- deadline is much simpler.



The tunable per-queue entries under /sys/block/*/iosched/ are all in
milliseconds:

* read_expire

  Controls how long until a request becomes "expired".

  It also controls the interval between which expired requests are served,
  so set to 50, a request might take anywhere < 100ms to be serviced _if_ it
  is the next on the expired list.

  Obviously it can't make the disk go faster.  Result is basically the
  timeslice a reader gets in the presence of other IO.  100*((seek time /
  read_expire) + 1) is very roughly the % streaming read efficiency your disk
  should get in the presence of multiple readers.

* read_batch_expire

  Controls how much time a batch of reads is given before pending writes
  are served.  Higher value is more efficient.  Shouldn't really be below
  read_expire.

* write_ versions of the above

* antic_expire

  Controls the maximum amount of time we can anticipate a good read before
  giving up.  Many other factors may cause anticipation to be stopped early,
  or some processes will not be "anticipated" at all.  Should be a bit higher
  for big seek time devices though not a linear correspondance - most
  processes have only a few ms thinktime.

97ff29c2

[PATCH] elevator completion API · 104e6fdc

Andrew Morton authored Jul 04, 2003

From: Nick Piggin <piggin@cyberone.com.au>

Introduces an elevator_completed_req() callback with which the generic
queueing layer may tell an IO scheduler that a particualr request has
finished.

104e6fdc

[PATCH] elv_may_queue() API function · 7d2483a9

Andrew Morton authored Jul 04, 2003

Introduces the elv_may_queue() predicate with which the IO scheduler may tell
the generic request layer that we may add another request to this queue.

It is used by the CFQ elevator.

7d2483a9

[PATCH] Create `kblockd' workqueue · 33c66485

Andrew Morton authored Jul 04, 2003

keventd is inappropriate for running block request queues because keventd
itself can get blocked on disk I/O.  Via call_usermodehelper()'s vfork and,
presumably, GFP_KERNEL allocations.

So create a new gang of kernel threads whose mandate is for running low-level
disk operations.  It must ever block on disk IO, so any memory allocations
should be GFP_NOIO.

We mainly use it for running unplug operations from interrupt context.

33c66485

[PATCH] bring back the batch_requests function · 3abbd8ff

Andrew Morton authored Jul 04, 2003

From: Nick Piggin <piggin@cyberone.com.au>

The batch_requests function got lost during the merge of the dynamic request
allocation patch.

We need it for the anticipatory scheduler - when the number of threads
exceeds the number of requests, the anticipated-upon task will undesirably
sleep in get_request_wait().

And apparently some block devices which use small requests need it so they
string a decent number together.

Jens has acked this patch.

3abbd8ff

[PATCH] ipc semaphore optimization · 3faa61fe

Andrew Morton authored Jul 04, 2003

From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>

This patch proposes a performance fix for the current IPC semaphore
implementation.

There are two shortcoming in the current implementation:
try_atomic_semop() was called two times to wake up a blocked process,
once from the update_queue() (executed from the process that wakes up
the sleeping process) and once in the retry part of the blocked process
(executed from the block process that gets woken up).

A second issue is that when several sleeping processes that are eligible
for wake up, they woke up in daisy chain formation and each one in turn
to wake up next process in line.  However, every time when a process
wakes up, it start scans the wait queue from the beginning, not from
where it was last scanned.  This causes large number of unnecessary
scanning of the wait queue under a situation of deep wait queue.
Blocked processes come and go, but chances are there are still quite a
few blocked processes sit at the beginning of that queue.

What we are proposing here is to merge the portion of the code in the
bottom part of sys_semtimedop() (code that gets executed when a sleeping
process gets woken up) into update_queue() function.  The benefit is two
folds: (1) is to reduce redundant calls to try_atomic_semop() and (2) to
increase efficiency of finding eligible processes to wake up and higher
concurrency for multiple wake-ups.

We have measured that this patch improves throughput for a large
application significantly on a industry standard benchmark.

This patch is relative to 2.5.72.  Any feedback is very much
appreciated.

Some kernel profile data attached:

  Kernel profile before optimization:
  -----------------------------------------------
                0.05    0.14   40805/529060      sys_semop [133]
                0.55    1.73  488255/529060      ia64_ret_from_syscall
[2]
[52]     2.5    0.59    1.88  529060         sys_semtimedop [52]
                0.05    0.83  477766/817966      schedule_timeout [62]
                0.34    0.46  529064/989340      update_queue [61]
                0.14    0.00 1006740/6473086     try_atomic_semop [75]
                0.06    0.00  529060/989336      ipcperms [149]
  -----------------------------------------------

                0.30    0.40  460276/989340      semctl_main [68]
                0.34    0.46  529064/989340      sys_semtimedop [52]
[61]     1.5    0.64    0.87  989340         update_queue [61]
                0.75    0.00 5466346/6473086     try_atomic_semop [75]
                0.01    0.11  477676/576698      wake_up_process [146]
  -----------------------------------------------
                0.14    0.00 1006740/6473086     sys_semtimedop [52]
                0.75    0.00 5466346/6473086     update_queue [61]
[75]     0.9    0.89    0.00 6473086         try_atomic_semop [75]
  -----------------------------------------------

  Kernel profile with optimization:

  -----------------------------------------------
                0.03    0.05   26139/503178      sys_semop [155]
                0.46    0.92  477039/503178      ia64_ret_from_syscall
[2]
[61]     1.2    0.48    0.97  503178         sys_semtimedop [61]
                0.04    0.79  470724/784394      schedule_timeout [62]
                0.05    0.00  503178/3301773     try_atomic_semop [109]
                0.05    0.00  503178/930934      ipcperms [149]
                0.00    0.03   32454/460210      update_queue [99]
  -----------------------------------------------
                0.00    0.03   32454/460210      sys_semtimedop [61]
                0.06    0.36  427756/460210      semctl_main [75]
[99]     0.4    0.06    0.39  460210         update_queue [99]
                0.30    0.00 2798595/3301773     try_atomic_semop [109]
                0.00    0.09  470630/614097      wake_up_process [146]
  -----------------------------------------------
                0.05    0.00  503178/3301773     sys_semtimedop [61]
                0.30    0.00 2798595/3301773     update_queue [99]
[109]    0.3    0.35    0.00 3301773         try_atomic_semop [109]
  -----------------------------------------------=20

Both number of function calls to try_atomic_semop() and update_queue()
are reduced by 50% as a result of the merge.  Execution time of
sys_semtimedop is reduced because of the reduction in the low level
functions.

3faa61fe

[PATCH] PCI domain scanning fix · d8d90b60

Andrew Morton authored Jul 04, 2003

From: Matthew Wilcox <willy@debian.org>

ppc64 oopses on boot because pci_scan_bus_parented() is unexpectedly
returning NULL.  Change pci_scan_bus_parented() to correctly handle
overlapping PCI bus numbers on different domains.

d8d90b60

Merge bk://ppc.bkbits.net/for-linus-ppc · b40585d0
Linus Torvalds authored Jul 04, 2003
```
into home.osdl.org:/home/torvalds/v2.5/linux
```
b40585d0
Merge samba.org:/home/paulus/kernel/linux-2.5 · b727fa42
Paul Mackerras authored Jul 06, 2003
```
into samba.org:/home/paulus/kernel/for-linus-ppc
```
b727fa42

04 Jul, 2003 15 commits

[PATCH] wrong pid in siginfo_t · c7aa953c

Ulrich Drepper authored Jul 04, 2003

If a signal is sent via kill() or tkill() the kernel fills in the wrong
PID value in the siginfo_t structure (obviously only if the handler has
SA_SIGINFO set).

POSIX specifies the the si_pid field is filled with the process ID, and
in Linux parlance that's the "thread group" ID, not the thread ID.

c7aa953c

When forcing through a signal for some thread-synchronous · 9e008c3c

Linus Torvalds authored Jul 04, 2003

event (ie SIGSEGV, SIGFPE etc that happens as a result of a
trap as opposed to an external event), if the signal is
blocked we will not invoce a signal handler, we will just
kill the thread with the signal.

This is equivalent to what we do in the SIG_IGN case: you
cannot ignore or block synchronous signals, and if you try,
we'll just have to kill you.

We don't want to handle endless recursive faults, which the
old behaviour easily led to if the stack was bad, for example.

9e008c3c

Go back to defaulting to 6-byte commands for MODE SENSE, · b79c8524

Linus Torvalds authored Jul 04, 2003

since some drivers seem to be unhappy about the 10-byte
version. 

The subsystem configuration can override this (eg USB or
ide-scsi).

b79c8524

[PATCH] EISA: avoid unnecessary probing · c4404d65

Marc Zyngier authored Jul 04, 2003

- By default, do not try to probe the bus if the mainboard does not
  seems to support EISA (allow this behaviour to be changed through a
  command-line option).

c4404d65

[PATCH] EISA: PCI-EISA dma_mask · e34121f7
Marc Zyngier authored Jul 04, 2003
```
- Use parent bridge device dma_mask as default for each discovered
  device.
```
e34121f7
[PATCH] EISA: PA-RISC changes · e0e5907e
Marc Zyngier authored Jul 04, 2003
```
- Probe the right number of EISA slots on PA-RISC. No more, no less.
```
e0e5907e
[PATCH] EISA: More EISA ids · 5fe1dbf4
Marc Zyngier authored Jul 04, 2003

5fe1dbf4
[PATCH] EISA: Documentation update · d8d9c9e8
Marc Zyngier authored Jul 04, 2003

d8d9c9e8

[PATCH] EISA: core changes · ddb6ee51

Marc Zyngier authored Jul 04, 2003

- Now reserves I/O ranges according to EISA specs (four 256 bytes
  regions instead of a single 4KB region).

- By default, do not try to probe the bus if the mainboard does not
  seems to support EISA (allow this behaviour to be changed through a
  command-line option).

- Use parent bridge device dma_mask as default for each discovered
  device.

- Allow devices to be enabled or disabled from the kernel command line
  (useful for non-x86 platforms where the firmware simply disable
  devices it doesn't know about...).

ddb6ee51

Merge bk://kernel.bkbits.net/jgarzik/irda-2.5 · ee389f0a
Linus Torvalds authored Jul 04, 2003
```
into home.osdl.org:/home/torvalds/v2.5/linux
```
ee389f0a

Carl-Daniel Hailfinger suggest adding a paranoid incoming · 87d890b8

Linus Torvalds authored Jul 04, 2003

trigger as per the "bk help triggers" suggestion, so that
we'll see any new triggers showing up in the tree.

Make it so.

87d890b8

[PATCH] Use the intents in 'nameidata' to improve NFS close-to-open consistency · 52d1430d

Trond Myklebust authored Jul 03, 2003

  - Make use of the open intents to improve close-to-open
    cache consistency. Only force data cache revalidation when
    we're doing an open().

  - Add true exclusive create to NFSv3.

  - Optimize away the redundant ->lookup() to check for an
    existing file when we know that we're doing NFSv3 exclusive
    create.

  - Optimize away all ->permission() checks other than those for
    path traversal, open(), and sys_access().

52d1430d

[PATCH] Pass 'nameidata' to ->permission() · a574f324

Trond Myklebust authored Jul 03, 2003

   - Make the VFS pass the struct nameidata as an optional parameter
     to the permission() inode operation.

   - Patch may_create()/may_open() so it passes the struct nameidata from
     vfs_create()/open_namei() as an argument to permission().

   - Add an intent flag for the sys_access() function.

a574f324

[PATCH] Pass 'nameidata' to ->create() · 675b5da0

Trond Myklebust authored Jul 03, 2003

  - Make the VFS pass the struct nameidata as an optional argument
    to the create inode operation.
  - Patch vfs_create() to take a struct nameidata as an optional
    argument.

675b5da0

[PATCH] Add open intent information to the 'struct nameidata' · fc8b427e

Trond Myklebust authored Jul 03, 2003

 - Add open intent information to the 'struct nameidata'.
 - Pass the struct nameidata as an optional parameter to the
   lookup() inode operation.
 - Pass the struct nameidata as an optional parameter to the
   d_revalidate() dentry operation.
 - Make link_path_walk() set the LOOKUP_CONTINUE flag in nd->flags instead
   of passing it as an extra parameter to d_revalidate().
 - Make open_namei(), and sys_uselib() set the open()/create() intent
   data.

fc8b427e

03 Jul, 2003 15 commits
- Merge bk://stop.crashing.org/linux-2.5-misc · 7246a035
  Paul Mackerras authored Jul 04, 2003
```
into samba.org:/home/paulus/kernel/for-linus-ppc
```
  7246a035
- Merge bk://stop.crashing.org/linux-2.5-obsolete · 5bb4ee47
  Paul Mackerras authored Jul 04, 2003
```
into samba.org:/home/paulus/kernel/for-linus-ppc
```
  5bb4ee47
- Merge samba.org:/home/paulus/kernel/linux-2.5 · 3cf09fdd
  Paul Mackerras authored Jul 04, 2003
```
into samba.org:/home/paulus/kernel/for-linus-ppc
```
  3cf09fdd
- [PATCH] fix via irq routing · 81523bf2
  Jeff Garzik authored Jul 03, 2003
```
Via irq routing has a funky PIRQD location.  I checked my datasheets
and, yep, this is correct all the way back to via686a.
This bug existed for _ages_.  I wonder if I created it, even...
```
  81523bf2
- Merge bk://kernel.bkbits.net/gregkh/linux/pci-2.5 · e631aa44
  Linus Torvalds authored Jul 03, 2003
```
into home.osdl.org:/home/torvalds/v2.5/linux
```
  e631aa44
- Re-organize "ext3_get_inode_loc()" and make it easier to · 9c67eccb
  Linus Torvalds authored Jul 03, 2003
```
follow by splitting it into two functions: one that calculates
the position, and the other that actually reads the inode
block off the disk.
```
  9c67eccb
- Add an asynchronous buffer read-ahead facility. Nobody · 4b226454
  Linus Torvalds authored Jul 03, 2003
```
uses it for now, but I needed it for some tuning tests,
and it is potentially useful for others.
```
  4b226454
- Merge kroah.com:/home/linux/BK/bleed-2.5 · 8394c855
  Greg Kroah-Hartman authored Jul 03, 2003
```
into kroah.com:/home/linux/BK/pci-2.5
```
  8394c855
- driver core: add my copyright to class.c · 7e2fa992
  Greg Kroah-Hartman authored Jul 03, 2003
  
  7e2fa992
- [PATCH] driver core: added class_device_rename() · 59c6630a
  Greg Kroah-Hartman authored Jul 03, 2003
```
Based on a patch written by Dan Aloni <da-x@gmx.net>
```
  59c6630a
- [PATCH] kobject: add kobject_rename() · e956d3ab
  Greg Kroah-Hartman authored Jul 03, 2003
```
Based on a patch written by Dan Aloni <da-x@gmx.net>
```
  e956d3ab
- [PATCH] sysfs: add sysfs_rename_dir() · f91c01ac
  Greg Kroah-Hartman authored Jul 03, 2003
```
Based on a patch written by Dan Aloni <da-x@gmx.net>
```
  f91c01ac
- [PATCH] jiffies include fix · 04798180
  John Stultz authored Jul 03, 2003
```
This patch fixes a bad declaration of jiffies in timer_tsc.c and
timer_cyclone.c, replacing it with the proper usage of jiffies.h.
Caught by gregkh.
```
  04798180
- [PATCH] SYSFS: add module referencing to sysfs attribute files. · 1cf6d20f
  Greg Kroah-Hartman authored Jul 03, 2003
  
  1cf6d20f
- Merge bk://linux-pnp.bkbits.net/pnp-2.5 · d23caa21
  Linus Torvalds authored Jul 03, 2003
```
into home.osdl.org:/home/torvalds/v2.5/linux
```
  d23caa21