1. 03 Sep, 2008 14 commits
    • KOSAKI Motohiro's avatar
      mm: size of quicklists shouldn't be proportional to the number of CPUs · b9541852
      KOSAKI Motohiro authored
      Quicklists store pages for each CPU as caches.  (Each CPU can cache
      node_free_pages/16 pages)
      
      It is used for page table cache.  exit() will increase the cache size,
      while fork() consumes it.
      
      So for example if an apache-style application runs (one parent and many
      child model), one CPU process will fork() while another CPU will process
      the middleware work and exit().
      
      At that time, the CPU on which the parent runs doesn't have page table
      cache at all.  Others (on which children runs) have maximum caches.
      
      	QList_max = (#ofCPUs - 1) x Free / 16
      	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)
      
      So, How much quicklist memory is used in the maximum case?
      
      This is proposional to # of CPUs because the limit of per cpu quicklist
      cache doesn't see the number of cpus.
      
      Above calculation mean
      
      	 Number of CPUs per node            2    4    8   16
      	 ==============================  ====================
      	 QList_max / (Free + QList_max)   5.8%  16%  30%  48%
      
      Wow! Quicklist can spend about 50% memory at worst case.
      
      My demonstration program is here
      --------------------------------------------------------------------------------
      #define _GNU_SOURCE
      
      #include <stdio.h>
      #include <errno.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sched.h>
      #include <unistd.h>
      #include <sys/mman.h>
      #include <sys/wait.h>
      
      #define BUFFSIZE 512
      
      int max_cpu(void)	/* get max number of logical cpus from /proc/cpuinfo */
      {
        FILE *fd;
        char *ret, buffer[BUFFSIZE];
        int cpu = 1;
      
        fd = fopen("/proc/cpuinfo", "r");
        if (fd == NULL) {
          perror("fopen(/proc/cpuinfo)");
          exit(EXIT_FAILURE);
        }
        while (1) {
          ret = fgets(buffer, BUFFSIZE, fd);
          if (ret == NULL)
            break;
          if (!strncmp(buffer, "processor", 9))
            cpu = atoi(strchr(buffer, ':') + 2);
        }
        fclose(fd);
        return cpu;
      }
      
      void cpu_bind(int cpu)	/* bind current process to one cpu */
      {
        cpu_set_t mask;
        int ret;
      
        CPU_ZERO(&mask);
        CPU_SET(cpu, &mask);
        ret = sched_setaffinity(0, sizeof(mask), &mask);
        if (ret == -1) {
          perror("sched_setaffinity()");
          exit(EXIT_FAILURE);
        }
        sched_yield();	/* not necessary */
      }
      
      #define MMAP_SIZE (10 * 1024 * 1024)	/* 10 MB */
      #define FORK_INTERVAL 1	/* 1 second */
      
      main(int argc, char *argv[])
      {
        int cpu_max, nextcpu;
        long pagesize;
        pid_t pid;
      
        /* set max number of logical cpu */
        if (argc > 1)
          cpu_max = atoi(argv[1]) - 1;
        else
          cpu_max = max_cpu();
      
        /* get the page size */
        pagesize = sysconf(_SC_PAGESIZE);
        if (pagesize == -1) {
          perror("sysconf(_SC_PAGESIZE)");
          exit(EXIT_FAILURE);
        }
      
        /* prepare parent process */
        cpu_bind(0);
        nextcpu = cpu_max;
      
      loop:
      
        /* select destination cpu for child process by round-robin rule */
        if (++nextcpu > cpu_max)
          nextcpu = 1;
      
        pid = fork();
      
        if (pid == 0) { /* child action */
      
          char *p;
          int i;
      
          /* consume page tables */
          p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
          i = MMAP_SIZE / pagesize;
          while (i-- > 0) {
            *p = 1;
            p += pagesize;
          }
      
          /* move to other cpu */
          cpu_bind(nextcpu);
      /*
          printf("a child moved to cpu%d after mmap().\n", nextcpu);
          fflush(stdout);
       */
      
          /* back page tables to pgtable_quicklist */
          exit(0);
      
        } else if (pid > 0) { /* parent action */
      
          sleep(FORK_INTERVAL);
          waitpid(pid, NULL, WNOHANG);
      
        }
      
        goto loop;
      }
      ----------------------------------------
      
      When above program which does task migration runs, my 8GB box spends
      800MB of memory for quicklist.  This is not memory leak but doesn't seem
      good.
      
      % cat /proc/meminfo
      
      MemTotal:        7701568 kB
      MemFree:         4724672 kB
      (snip)
      Quicklists:       844800 kB
      
      because
      
      - My machine spec is
      	number of numa node: 2
      	number of cpus:      8 (4CPU x2 node)
              total mem:           8GB (4GB x2 node)
              free mem:            about 5GB
      
      - Then, 4.7GB x 16% ~= 880MB.
        So, Quicklist can use 800MB.
      
      So, if following spec machine run that program
      
         CPUs: 64 (8cpu x 8node)
         Mem:  1TB (128GB x8node)
      
      Then, quicklist can waste 300GB (= 1TB x 30%).  It is too large.
      
      So, I don't like cache policies which is proportional to # of cpus.
      
      My patch changes the number of caches
      from:
         per-cpu-cache-amount = memory_on_node / 16
      to
         per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Tested-by: default avatarDavid Miller <davem@davemloft.net>
      Acked-by: default avatarMike Travis <travis@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9541852
    • KOSAKI Motohiro's avatar
      mm: show quicklist usage in /proc/meminfo · 4b856152
      KOSAKI Motohiro authored
      Quicklists can consume several GB of memory.  We should provide a means of
      monitoring this.
      
      After this patch is applied, /proc/meminfo will output the following:
      
      % cat /proc/meminfo
      
      MemTotal:      7715392 kB
      MemFree:       5401600 kB
      Buffers:         80384 kB
      Cached:         300800 kB
      SwapCached:          0 kB
      Active:         235584 kB
      Inactive:       262656 kB
      SwapTotal:     2031488 kB
      SwapFree:      2031488 kB
      Dirty:            3520 kB
      Writeback:           0 kB
      AnonPages:      117696 kB
      Mapped:          38528 kB
      Slab:          1589952 kB
      SReclaimable:    23104 kB
      SUnreclaim:    1566848 kB
      PageTables:      14656 kB
      NFS_Unstable:        0 kB
      Bounce:              0 kB
      WritebackTmp:        0 kB
      CommitLimit:   5889152 kB
      Committed_AS:   393152 kB
      VmallocTotal: 17592177655808 kB
      VmallocUsed:     29056 kB
      VmallocChunk: 17592177626432 kB
      Quicklists:     130944 kB
      HugePages_Total:     0
      HugePages_Free:      0
      HugePages_Rsvd:      0
      HugePages_Surp:      0
      Hugepagesize:    262144 kB
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b856152
    • Li Zefan's avatar
      devcgroup: fix race against rmdir() · 36fd71d2
      Li Zefan authored
      During the use of a dev_cgroup, we should guarantee the corresponding
      cgroup won't be deleted (i.e.  via rmdir).  This can be done through
      css_get(&dev_cgroup->css), but here we can just get and use the dev_cgroup
      under rcu_read_lock.
      
      And also remove checking NULL dev_cgroup, it won't be NULL since a task
      always belongs to a cgroup.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36fd71d2
    • Krzysztof Helt's avatar
      cirrusfb: check_par fixes · 09a2910e
      Krzysztof Helt authored
      1. Check if virtual resolution fits into memory.
         Otherwise, Linux hangs during panning.
      2. When selected use all available memory to
          maximize yres_virtual to speed up panning
         (previously also xres_virtual was increased).
      3. Simplify memory restriction calculations.
      Signed-off-by: default avatarKrzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09a2910e
    • Oleg Nesterov's avatar
      pid_ns: (BUG 11391) change ->child_reaper when init->group_leader exits · 950bbabb
      Oleg Nesterov authored
      We don't change pid_ns->child_reaper when the main thread of the
      subnamespace init exits.  As Robert Rex <robert.rex@exasol.com> pointed
      out this is wrong.
      
      Yes, the re-parenting itself works correctly, but if the reparented task
      exits it needs ->parent->nsproxy->pid_ns in do_notify_parent(), and if the
      main thread is zombie its ->nsproxy was already cleared by
      exit_task_namespaces().
      
      Introduce the new function, find_new_reaper(), which finds the new
      ->parent for the re-parenting and changes ->child_reaper if needed.  Kill
      the now unneeded exit_child_reaper().
      
      Also move the changing of ->child_reaper from zap_pid_ns_processes() to
      find_new_reaper(), this consolidates the games with ->child_reaper and
      makes it stable under tasklist_lock.
      
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=11391Reported-by: default avatarRobert Rex <robert.rex@exasol.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Acked-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Acked-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      950bbabb
    • Oleg Nesterov's avatar
      pid_ns: zap_pid_ns_processes: fix the ->child_reaper changing · add0d4df
      Oleg Nesterov authored
      zap_pid_ns_processes() sets pid_ns->child_reaper = NULL, this is wrong.
      
      Yes, we have already killed all tasks in this namespace, and sys_wait4()
      doesn't see any child.  But this doesn't mean ->children list is empty, we
      may have EXIT_DEAD tasks which are not visible to do_wait().  In that case
      the subsequent forget_original_parent() will crash the kernel because it
      will try to re-parent these tasks to the NULL reaper.
      
      Even if there are no childs, it is not good that forget_original_parent()
      uses reaper == NULL.
      
      Change the code to set ->child_reaper = init_pid_ns.child_reaper instead.
      We could use pid_ns->parent->child_reaper as well, I think this does not
      really matter.  These EXIT_DEAD tasks are not visible to the new ->parent
      after re-parenting, they will silently do release_task() eventually.
      
      Note that we must change ->child_reaper, otherwise
      forget_original_parent() will use reaper == father, and in that case we
      will hit the (correct) BUG_ON(!list_empty(&father->children)).
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Acked-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Acked-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      add0d4df
    • David Brownell's avatar
      mmc: at91_mci: don't use coherent dma buffers · e385ea63
      David Brownell authored
      At91_mci is abusing dma_free_coherent(), which may not be called with IRQs
      disabled.  I saw "mkfs.ext3" on an MMC card objecting voluminously as each
      write completed:
      
       WARNING: at arch/arm/mm/consistent.c:368 dma_free_coherent+0x2c/0x224()
       [<c002726c>] (dump_stack+0x0/0x14) from [<c00387d4>] (warn_on_slowpath+0x4c/0x68)
       [<c0038788>] (warn_on_slowpath+0x0/0x68) from [<c0028768>] (dma_free_coherent+0x2c/0x224)
        r6:00008008 r5:ffc06000 r4:00000000
       [<c002873c>] (dma_free_coherent+0x0/0x224) from [<c01918ac>] (at91_mci_irq+0x374/0x420)
       [<c0191538>] (at91_mci_irq+0x0/0x420) from [<c0065d9c>] (handle_IRQ_event+0x2c/0x6c)
       ...
      
      This bug has been around for a LONG time.  The MM warning is from late
      2005, but the driver merged a year later ...  so I'm puzzled why nobody
      noticed this before now.
      
      The fix involves noting that this buffer shouldn't be DMA-coherent; it's
      just used for normal DMA writes.  So replace it with standard kmalloc()
      buffering and DMA mapping calls.
      
      This is the quickie fix.  A better one would not rely on allocating large
      bounce buffers.  (Note that dma_alloc_coherent could have failed too, but
      that case was ignored...  kmalloc is a bit more likely to fail though.)
      Signed-off-by: default avatarDavid Brownell <dbrownell@users.sourceforge.net>
      Acked-by: default avatarPierre Ossman <drzeus-mmc@drzeus.cx>
      Cc: Andrew Victor <linux@maxim.org.za>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e385ea63
    • Will Newton's avatar
      8250: improve workaround for UARTs that don't re-assert THRE correctly · 363f66fe
      Will Newton authored
      Recent changes to tighten the check for UARTs that don't correctly
      re-assert THRE (01c194d9: "serial 8250:
      tighten test for using backup timer") caused problems when such a UART was
      opened for the second time - the bug could only successfully be detected
      at first initialization.  For users of this version of this particular
      UART IP it is fatal.
      
      This patch stores the information about the bug in the bugs field of the
      port structure when the port is first started up so subsequent opens can
      check this bit even if the test for the bug fails.
      
      David Brownell: "My own exposure to this is that the UART on DaVinci
      hardware, which TI allegedly derived from its original 16550 logic, has
      periodically gone from working to unusable with the mainline 8250.c ...
      and back and forth a bunch.  Currently it's "unusable", a regression from
      some previous versions.  With this patch from Will, it's usable."
      Signed-off-by: default avatarWill Newton <will.newton@gmail.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@hp.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: <stable@kernel.org>		[2.6.26.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      363f66fe
    • Henrik Rydberg's avatar
      MAINTAINERS: add a maintainer for the BCM5974 multitouch driver · bd7aa4b2
      Henrik Rydberg authored
      Signed-off-by: default avatarHenrik Rydberg <rydberg@euromail.se>
      Cc: Dmitry Torokhov <dtor@mail.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd7aa4b2
    • Marcin Slusarz's avatar
      mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data · 52765583
      Marcin Slusarz authored
      WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
      The variable contig_page_data references
      the variable __initdata bootmem_node_data
      If the reference is valid then annotate the
      variable with __init* (see linux/init.h) or name the variable:
      *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
      Signed-off-by: default avatarMarcin Slusarz <marcin.slusarz@gmail.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Sean MacLennan <smaclennan@pikatech.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52765583
    • Russ Dill's avatar
      acer-wmi: remove debugfs entries upon unloading · 39dbbb45
      Russ Dill authored
      The exit function neglects to remove debugfs entries, leading to a BUG
      on reload.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarRuss Dill <Russ.Dill@gmail.com>
      Acked-by: default avatarCarlos Corbacho <carlos@strangeworlds.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39dbbb45
    • Hisashi Hifumi's avatar
      VFS: fix dio write returning EIO when try_to_release_page fails · 6ccfa806
      Hisashi Hifumi authored
      Dio write returns EIO when try_to_release_page fails because bh is
      still referenced.
      
      The patch
      
          commit 3f31fddf
          Author: Mingming Cao <cmm@us.ibm.com>
          Date:   Fri Jul 25 01:46:22 2008 -0700
      
              jbd: fix race between free buffer and commit transaction
      
      was merged into 2.6.27-rc1, but I noticed that this patch is not enough
      to fix the race.
      
      I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
      sometimes got EIO through this test.
      
      The patch above fixed race between freeing buffer(dio) and committing
      transaction(jbd) but I discovered that there is another race, freeing
      buffer(dio) and ext3/4_ordered_writepage.
      
      : background_writeout()
           ->write_cache_pages()
             ->ext3_ordered_writepage()
           	   walk_page_buffers() -> take a bh ref
       	   block_write_full_page() -> unlock_page
      		: <- end_page_writeback
                      : <- race! (dio write->try_to_release_page fails)
            	   walk_page_buffers() ->release a bh ref
      
      ext3_ordered_writepage holds bh ref and does unlock_page remaining
      taking a bh ref, so this causes the race and failure of
      try_to_release_page.
      
      To fix this race, I used the approach of falling back to buffered
      writes if try_to_release_page() fails on a page.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ccfa806
    • Adam Litke's avatar
      mm: make setup_zone_migrate_reserve() aware of overlapping nodes · 344c790e
      Adam Litke authored
      I have gotten to the root cause of the hugetlb badness I reported back on
      August 15th.  My system has the following memory topology (note the
      overlapping node):
      
                  Node 0 Memory: 0x8000000-0x44000000
                  Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000
      
      setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
      for a pageblock to move onto the MIGRATE_RESERVE list.  Finding no
      candidates, it happily continues the scan into 0x8000000-0x44000000.  When
      a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
      the wrong zone.  Oops.
      
      setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      344c790e
    • Adrian Bunk's avatar
      NTFS: update homepage · 169ccbd4
      Adrian Bunk authored
      Update the location of the NTFS homepage in several files.
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Cc: Jeff Garzik <jeff@garzik.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      169ccbd4
  2. 02 Sep, 2008 26 commits