1. 09 Sep, 2002 28 commits
  2. 08 Sep, 2002 12 commits
    • Alexander Viro's avatar
      [PATCH] handle_initrd() and request_module() · 09589177
      Alexander Viro authored
      There are 4 different scenarios of late boot:
      
      1.	no initrd or ROOT_DEV is ram0.  That's the simplest one - we want
      	whatever is on ROOT_DEV as final root.
      
      2.	initrd is there, ROOT_DEV is not ram0, /linuxrc on initrd doesn't
      	exit.   We want initrd mounted, /linuxrc launched and /linuxrc
      	will mount whatever it wants, maybe do pivot_root and exec init
      	itself.  Task with PID 1 (parent of linuxrc) will sit there reaping
      	zombies, never leaving the kernel mode.
      
      3.	initrd is there, ROOT_DEV is not ram0, /linuxrc on initrd does exit
      	and sets real-root-dev to 256 (1:0, aka. ram0).   We want initrd
      	mounted, /linuxrc launched and we expect linuxrc to mount all stuff
      	we need, maybe do pivot root and exit.  Parent of /linuxrc (PID 1)
      	will proceed to exec init once /linuxrc is done.
      
      4.	initrd is there, ROOT_DEV is not ram0, /linuxrc on initrd might have
      	done something or not, but when it exits real-root-dev is not ram0.
      	We want initrd mounted, /linuxrc launched and when it exits we are
      	going to mount final root according to real-root-dev.  If there is
      	/initrd on the final root, initrd will be moved there.  Otherwise
      	initrd will be unmounted and its memory (if possible) freed.  Then
      	we exec init from final root.
      
      Note that we want the parent of linuxrc chrooted to initrd while linuxrc
      runs - otherwise things like request_module() will be rather unhappy.  That
      goes for all variants that run linuxrc.
      
      Scenarios above go in order of increasing complexity.  Let's start with #4:
      
      	we had loaded initrd
      	we mount initrd on /root
      	we open / and /old (on initrd)
      	chdir /root
      	mount -- move . /
      	chroot .
      
      Now we have initrd mounted on /, we are chrooted into it but keep opened
      descriptors of / and /old, so we'll be able to break out of jail later.
      
      	we fork a child that will be linuxrc
      	child closes opened descriptors, opens /dev/console, dups it to stdout
      and stderr, does setsid and execs /linuxrc.
      
      	parent sits there reaping zombies until child is finished.
      
      Note that both parent and linuxrc are chrooted into /initrd and if linuxrc
      calls pivot_root, the parent will also have its root/cwd switched.
      
      	OK, child is finished and after checking real_root_dev we see that
      it's not MKDEV(1,0).  Now we know that it's scenario #4.
      We break out of jail, doing the following:
      
      	fchdir to /old on rootfs
      	mount --move / .
      	fchdir to / on rootfs
      	chroot to .
      
      That will move initrd to /old and leave us with root and cwd in / of rootfs.
      We can close these two descriptors now - they'd done their job.
      
      	We mount final root to /root
      	We attempt to mount -- move /old /root/initrd; if we are successful -
      we chdir to /root, mount --move . / and chroot to .  That will leave us with
      	* final root on /
      	* initrd on /initrd of final root
      	* cwd and root on final root.
      
      At that point we simply exec init.
      
      	Now, if mount --move had failed, we got to clean up the mess.  We
      unmount (with MNT_DETACH) initrd from /old and do BLKFLSBUF on ram0.  After
      that we have final root on /root, initrd maybe still alive, but not mounted
      anywhere and our root/cwd in / of rootfs.  Again,
      
      	chdir /root
      	mount --move . /
      	chroot to .
      
      and we have final root mounted on /, we are chrooted into it and it's time
      for exec init.
      
      	That's it for scenario 4.  The rest will be simpler - there's less
      work to do.
      
      #3 diverges from #4 after linuxrc had finished and we had already broken out
      of jail.  Whatever we got from linuxrc is mounted on /old now, so we move it
      back to /, get chrooted there and exec init.   We could've left earlier
      (skipping the move to /old and move back parts), but that would lead to
      even messier logics in prepare_namespace() ;-/
      
      #2 means that parent of /linuxrc never gets past waiting its child to finish.
      End of story.
      
      #1 is the simplest variant - it mounts final root on /root and then does usual
      "chdir there, mount --move . /, chroot to ." and execs init.
      
      Relevant code is in prepare_namespace()/handle_initrd() and yes, it's messy.
      Had been even worse... ;-/
      09589177
    • Hugh Dickins's avatar
      [PATCH] M386 flush_one_tlb invlpg · 43b138da
      Hugh Dickins authored
      CONFIG_M386 kernel running on PPro+ processor with X86_FEATURE_PGE may
      set _PAGE_GLOBAL bit: then __flush_tlb_one must use invlpg instruction.
      H. J. Lu reports (LKML 8 Sept) that his P4 reboots due to this problem.
      43b138da
    • Ingo Molnar's avatar
      [PATCH] Re: pinpointed: PANIC caused by dequeue_signal() in current Linus · 49ba178c
      Ingo Molnar authored
      This fixes the bootup crash.  There were two initialization bugs:
      
      	- INIT_SIGNAL needs to set shared_pending.
      
      	- exec() needs to set up newsig properly.
      
      the second one caused the crash Anton saw.
      49ba178c
    • Andrew Morton's avatar
      [PATCH] Use kmap_atomic() for generic_file_write() · 86ee4c5d
      Andrew Morton authored
      This patch uses the atomic copy_from_user() facility in
      generic_file_write().
      
      This required a change in the prepare_write/commit_write API
      definition.  It is no longer the case that these functions will kmap
      the page for you.
      
      If any part of the kernel wants to get at the page in the write path,
      it now has to kmap it for itself.  The best way to do this is with
      kmap_atomic(KM_USER0).
      
      This patch updates all callers.  It also converts several places which
      were unnecessarily using kmap() over to using kmap_atomic().
      
      The reiserfs changes here are Oleg Drokin's revised version.
      
      The patch has been tested with loop, ext2, ext3, reiserfs, jfs,
      minixfs, vfat, iso9660, nfs and the ramdisk driver.
      
      I haven't fixed the racy deadlock avoidance thing in
      generic_file_write() - the case where we take a fault when the source
      and dest of the copy are both the same pagecache page.
      
      There is a printk in there now which will trigger if the page was
      unexpectedly not present.  And guess what?  I get 50-100 of them when
      running `dbench 64' on mem=48m.   This deadlock can happen.
      86ee4c5d
    • Andrew Morton's avatar
      [PATCH] Use kmap_atomic() for generic_file_read() · 88a3b490
      Andrew Morton authored
      This patch allows the kernel to hold atomic kmaps in file_read_actor().
      
      We try to fault in the page, then take an atomic kmap.  If the atomic
      copy_to_user() then faults, drop a printk and fall back to kmap().
      88a3b490
    • Andrew Morton's avatar
      [PATCH] atomic copy_*_user infrastructure · 4b19c940
      Andrew Morton authored
      The patch implements the atomic copy_*_user() function.
      
      If the kernel takes a pagefault while running copy_*_user() in an
      atomic region, the copy_*_user() will fail (return a short value).
      
      And with this patch, holding an atomic kmap() puts the CPU into an
      atomic region.
      
      - Increment preempt_count() in kmap_atomic() regardless of the
        setting of CONFIG_PREEMPT.  The pagefault handler recognises this as
        an atomic region and refuses to service the fault.  copy_*_user will
        return a non-zero value.
      
      - Attempts to propagate the in_atomic() predicate to all the other
        highmem-capable architectures' pagefault handlers.  But the code is
        only tested on x86.
      
      - Fixed a PPC bug in kunmap_atomic(): it forgot to reenable
        preemption if HIGHMEM_DEBUG is turned on.
      
      - Fixed a sparc bug in kunmap_atomic(): it forgot to reenable
        preemption all the time, for non-fixmap pages.
      
      - Fix an error in <linux/highmem.h> - in the CONFIG_HIGHMEM=n case,
        kunmap_atomic() takes an address, not a page *.
      4b19c940
    • Andrew Morton's avatar
      [PATCH] refill the inactive list more quickly · 5f607d6e
      Andrew Morton authored
      Fix a problem noticed by Ed Tomlinson: under shifting workloads the
      shrink_zone() logic will refill the inactive load too slowly.
      
      Bale out of the zone scan when we've reclaimed enough pages.  Fixes a
      rarely-occurring problem wherein refill_inactive_zone() ends up
      shuffling 100,000 pages and generally goes silly.
      
      This needs to be revisited - we should go on and rebalance the lower
      zones even if we reclaimed enough pages from highmem.
      5f607d6e
    • Andrew Morton's avatar
      [PATCH] Back out the initial work for atomic copy_*_user() · 9fdbd959
      Andrew Morton authored
      Back out the use of preempt_count to signify atomicity wrt pagefaults.
      We won't do it that way - in_atomic() works fine.
      9fdbd959
    • Andrew Morton's avatar
      [PATCH] Fix the __block_write_full_page() error path. · 0e64a39d
      Andrew Morton authored
      Fix the ENOSPC recovery code in __block_write_full_page()
      
      - Don't write out clean buffers.
      
      - Set PG_writeback before submitting the IO.  Otherwise the completion
        handler will go BUG when it sees a non-PageWriteback page.  If the IO
        is very fast, or synchronous.
      0e64a39d
    • Andrew Morton's avatar
      [PATCH] Fix the boot-time reporting of each zone's available pages · d98b1feb
      Andrew Morton authored
      Patch from Bjorn Helgaas, via Rusty.
      
      Change:
      
        On node 0 totalpages: 61031         <--- not including holes
        zone(0): 65172 pages.               <--- including holes
        zone(1): 0 pages.                   ...
        zone(2): 0 pages.
      
      to:
      
        On node 0 totalpages: 61031         <--- not including holes
        DMA zone: 61031 pages               <--- not including holes
        Normal zone: 0 pages
        HighMem zone: 0 pages
      d98b1feb
    • Ingo Molnar's avatar
      [PATCH] shared thread signals · 6dfc8897
      Ingo Molnar authored
      Support POSIX compliant thread signals on a kernel level with usable
      debugging (broadcast SIGSTOP, SIGCONT) and thread group management
      (broadcast SIGKILL), plus to load-balance 'process' signals between
      threads for better signal performance. 
      
      Changes:
      
      - POSIX thread semantics for signals
      
      there are 7 'types' of actions a signal can take: specific, load-balance,
      kill-all, kill-all+core, stop-all, continue-all and ignore. Depending on
      the POSIX specifications each signal has one of the types defined for both
      the 'handler defined' and the 'handler not defined (kernel default)' case.  
      Here is the table:
      
       ----------------------------------------------------------
       |                    |  userspace       |  kernel        |
       ----------------------------------------------------------
       |  SIGHUP            |  load-balance    |  kill-all      |
       |  SIGINT            |  load-balance    |  kill-all      |
       |  SIGQUIT           |  load-balance    |  kill-all+core |
       |  SIGILL            |  specific        |  kill-all+core |
       |  SIGTRAP           |  specific        |  kill-all+core |
       |  SIGABRT/SIGIOT    |  specific        |  kill-all+core |
       |  SIGBUS            |  specific        |  kill-all+core |
       |  SIGFPE            |  specific        |  kill-all+core |
       |  SIGKILL           |  n/a             |  kill-all      |
       |  SIGUSR1           |  load-balance    |  kill-all      |
       |  SIGSEGV           |  specific        |  kill-all+core |
       |  SIGUSR2           |  load-balance    |  kill-all      |
       |  SIGPIPE           |  specific        |  kill-all      |
       |  SIGALRM           |  load-balance    |  kill-all      |
       |  SIGTERM           |  load-balance    |  kill-all      |
       |  SIGCHLD           |  load-balance    |  ignore        |
       |  SIGCONT           |  load-balance    |  continue-all  |
       |  SIGSTOP           |  n/a             |  stop-all      |
       |  SIGTSTP           |  load-balance    |  stop-all      |
       |  SIGTTIN           |  load-balancen   |  stop-all      |
       |  SIGTTOU           |  load-balancen   |  stop-all      |
       |  SIGURG            |  load-balance    |  ignore        |
       |  SIGXCPU           |  specific        |  kill-all+core |
       |  SIGXFSZ           |  specific        |  kill-all+core |
       |  SIGVTALRM         |  load-balance    |  kill-all      |
       |  SIGPROF           |  specific        |  kill-all      |
       |  SIGPOLL/SIGIO     |  load-balance    |  kill-all      |
       |  SIGSYS/SIGUNUSED  |  specific        |  kill-all+core |
       |  SIGSTKFLT         |  specific        |  kill-all      |
       |  SIGWINCH          |  load-balance    |  ignore        |
       |  SIGPWR            |  load-balance    |  kill-all      |
       |  SIGRTMIN-SIGRTMAX |  load-balance    |  kill-all      |
       ----------------------------------------------------------
      
      as you can see it from the list, signals that have handlers defined never 
      get broadcasted - they are either specific or load-balanced.
      
      - CLONE_THREAD implies CLONE_SIGHAND
      
      It does not make much sense to have a thread group that does not share
      signal handlers. In fact in the patch i'm using the signal spinlock to
      lock access to the thread group. I made the siglock IRQ-safe, thus we can
      load-balance signals from interrupt contexts as well. (we cannot take the
      tasklist lock in write mode from IRQ handlers.)
      
      this is not as clean as i'd like it to be, but it's the best i could come
      up with so far.
      
      - thread group list management reworked.
      
      threads are now removed from the group if the thread is unhashed from the
      PID table. This makes the most sense. This also helps with another feature 
      that relies on an intact thread group list: multithreaded coredumps.
      
      - child reparenting reworked.
      
      the O(N) algorithm in forget_original_parent() causes massive performance
      problems if a large number of threads exit from the group. Performance 
      improves more than 10-fold if the following simple rules are followed 
      instead:
      
       - reparent children to the *previous* thread [exiting or not]
       - if a thread is detached then reparent to init.
      
      - fast broadcasting of kernel-internal SIGSTOP, SIGCONT, SIGKILL, etc.
      
      kernel-internal broadcasted signals are a potential DoS problem, since
      they might generate massive amounts of GFP_ATOMIC allocations of siginfo
      structures. The important thing to note is that the siginfo structure does
      not actually have to be allocated and queued - the signal processing code
      has all the information it needs, neither of these signals carries any
      information in the siginfo structure. This makes a broadcast SIGKILL a
      very simple operation: all threads get the bit 9 set in their pending
      bitmask. The speedup due to this was significant - and the robustness win
      is invaluable.
      
      - sys_execve() should not kill off 'all other' threads.
      
      the 'exec kills all threads if the master thread does the exec()' is a
      POSIX(-ish) thing that should not be hardcoded in the kernel in this case.
      
      to handle POSIX exec() semantics, glibc uses a special syscall, which
      kills 'all but self' threads: sys_exit_allbutself().
      
      the straightforward exec() implementation just calls sys_exit_allbutself()  
      and then sys_execve().
      
      (this syscall is also be used internally if the thread group leader
      thread sys_exit()s or sys_exec()s, to ensure the integrity of the thread
      group.)
      6dfc8897
    • Ivan Kokshaysky's avatar
      [PATCH] pci bus resources, transparent bridges · 36780249
      Ivan Kokshaysky authored
      Added PCI_BUS_NUM_RESOURCES as Ben suggested. Default value is 4
      and can be overridden by arch (probably in asm/system.h).
      pci_read_bridge_bases() and pci_assign_bus_resource() changed
      accordingly. "for (i = 0 ; i < 4; i++)" in pci_add_new_bus() not
      changed, as it's used _only_ for pci-pci and cardbus bridges.
      36780249