1. 21 Oct, 2008 3 commits
    • Mikulas Patocka's avatar
      dm snapshot: fix primary_pe race · 7c5f78b9
      Mikulas Patocka authored
      Fix a race condition with primary_pe ref_count handling.
      
      put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test
      on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count.
      
      __origin_write does atomic_dec_and_test on primary_pe->ref_count without holding
      dm_snapshot->lock.
      
      This opens the following race condition:
      Assume two CPUs, CPU1 is executing put_pending_exception (and holding
      dm_snapshot->lock). CPU2 is executing __origin_write in parallel.
      primary_pe->ref_count == 2.
      
      CPU1:
      if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count))
      	origin_bios = bio_list_get(&primary_pe->origin_bios);
      ... decrements primary_pe->ref_count to 1. Doesn't load origin_bios
      
      CPU2:
      if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
      	flush_bios(bio_list_get(&primary_pe->origin_bios));
      	free_pending_exception(primary_pe);
      	/* If we got here, pe_queue is necessarily empty. */
      	return r;
      }
      ... decrements primary_pe->ref_count to 0, submits pending bios, frees
      primary_pe.
      
      CPU1:
      if (!primary_pe || primary_pe != pe)
      	free_pending_exception(pe);
      ... this has no effect.
      if (primary_pe && !atomic_read(&primary_pe->ref_count))
      	free_pending_exception(primary_pe);
      ... sees ref_count == 0 (written by CPU 2), does double free !!
      
      This bug can happen only if someone is simultaneously writing to both the
      origin and the snapshot.
      
      If someone is writing only to the origin, __origin_write will submit kcopyd
      request after it decrements primary_pe->ref_count (so it can't happen that the
      finished copy races with primary_pe->ref_count decrementation).
      
      If someone is writing only to the snapshot, __origin_write isn't invoked at all
      and the race can't happen.
      
      The race happens when someone writes to the snapshot --- this creates
      pending_exception with primary_pe == NULL and starts copying. Then, someone
      writes to the same chunk in the snapshot, and __origin_write races with
      termination of already submitted request in pending_complete (that calls
      put_pending_exception).
      
      This race may be reason for bugs:
        http://bugzilla.kernel.org/show_bug.cgi?id=11636
        https://bugzilla.redhat.com/show_bug.cgi?id=465825
      
      The patch fixes the code to make sure that:
      1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process
      must no longer dereference primary_pe (because someone else may free it under
      us).
      2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process
      is responsible for freeing primary_pe.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      7c5f78b9
    • Kazuo Ito's avatar
      dm kcopyd: avoid queue shuffle · b673c3a8
      Kazuo Ito authored
      Write throughput to LVM snapshot origin volume is an order
      of magnitude slower than those to LV without snapshots or
      snapshot target volumes, especially in the case of sequential
      writes with O_SYNC on.
      
      The following patch originally written by Kevin Jamieson and
      Jan Blunck and slightly modified for the current RCs by myself
      tries to improve the performance by modifying the behaviour
      of kcopyd, so that it pushes back an I/O job to the head of
      the job queue instead of the tail as process_jobs() currently
      does when it has to wait for free pages. This way, write
      requests aren't shuffled to cause extra seeks.
      
      I tested the patch against 2.6.27-rc5 and got the following results.
      The test is a dd command writing to snapshot origin followed by fsync
      to the file just created/updated.  A couple of filesystem benchmarks
      gave me similar results in case of sequential writes, while random
      writes didn't suffer much.
      
      dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=...
         [conv=notrunc when updating]
      
      1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
      average throughput (MB/s)
                           10M     100M    1000M
      create,dd         511.46   610.72    11.81
      create,dd+fsync     7.10     6.77     8.13
      update,dd         431.63   917.41    12.75
      update,dd+fsync     7.79     7.43     8.12
      
      compared with write throughput to LV without any snapshots,
      all dd+fsync and 1000 MiB writes perform very poorly.
      
                           10M     100M    1000M
      create,dd         555.03   608.98   123.29
      create,dd+fsync   114.27    72.78    76.65
      update,dd         152.34  1267.27   124.04
      update,dd+fsync   130.56    77.81    77.84
      
      2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
      average throughput (MB/s)
      
                           10M     100M    1000M
      create,dd         537.06   589.44    46.21
      create,dd+fsync    31.63    29.19    29.23
      update,dd         487.59   897.65    37.76
      update,dd+fsync    34.12    30.07    26.85
      
      Although still not on par with plain LV performance -
      cannot be avoided because it's copy on write anyway -
      this simple patch successfully improves throughtput
      of dd+fsync while not affecting the rest.
      Signed-off-by: default avatarJan Blunck <jblunck@suse.de>
      Signed-off-by: default avatarKazuo Ito <ito.kazuo@oss.ntt.co.jp>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      b673c3a8
    • Paul Mundt's avatar
      binfmt_elf_fdpic: Update for cputime changes. · 2515ddc6
      Paul Mundt authored
      Commit f06febc9 ("timers: fix itimer/
      many thread hang") introduced a new task_cputime interface and
      subsequently only converted binfmt_elf over to it.  This results in the
      build for binfmt_elf_fdpic blowing up given that p->signal->{u,s}time
      have disappeared from underneath us.
      
      Apply the same trivial fix from binfmt_elf to binfmt_elf_fdpic.
      Signed-off-by: default avatarPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2515ddc6
  2. 20 Oct, 2008 37 commits