1. 29 Oct, 2008 6 commits
    • Chris Mason's avatar
    • Yan Zheng's avatar
      Btrfs: Add root tree pointer transaction ids · 84234f3a
      Yan Zheng authored
      This patch adds transaction IDs to root tree pointers.
      Transaction IDs in tree pointers are compared with the
      generation numbers in block headers when reading root
      blocks of trees. This can detect some types of IO errors.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      
      84234f3a
    • Josef Bacik's avatar
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik authored
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      
      25179201
    • Josef Bacik's avatar
      Btrfs: fix enospc when there is plenty of space · 80eb234a
      Josef Bacik authored
      So there is an odd case where we can possibly return -ENOSPC when there is in
      fact space to be had.  It only happens with Metadata writes, and happens _very_
      infrequently.  What has to happen is we have to allocate have allocated out of
      the first logical byte on the disk, which would set last_alloc to
      first_logical_byte(root, 0), so search_start == orig_search_start.  We then
      need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
      BTRFS_BLOCK_GROUP_DUP.  We will do a block lookup for the given search_start,
      block_group_bits() won't match and we'll go to choose another block group.
      However because search_start matches orig_search_start we go to see if we can
      allocate a chunk.
      
      If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
      This is kind of a big flaw of the way find_free_extent works, as it along with
      find_free_space loop through _all_ of the block groups, not just the ones that
      we want to allocate out of.  This patch completely kills find_free_space and
      rolls it into find_free_extent.  I've introduced a sort of state machine into
      this, which will make it easier to get cache miss information out of the
      allocator, and will work well with my locking changes.
      
      The basic flow is this:  We have the variable loop which is 0, meaning we are
      in the hint phase.  We lookup the block group for the hint, and lookup the
      space_info for what we want to allocate out of.  If the block group we were
      pointed at by the hint either isn't of the correct type, or just doesn't have
      the space we need, we set head to space_info->block_groups, so we start at the
      beginning of the block groups for this particular space info, and loop through.
      
      This is also where we add the empty_cluster to total_needed.  At this point
      loop is set to 1 and we just loop through all of the block groups for this
      particular space_info looking for the space we need, just as find_free_space
      would have done, except we only hit the block groups we want and not _all_ of
      the block groups.  If we come full circle we see if we can allocate a chunk.
      If we cannot of course we exit with -ENOSPC and we are good.  If not we start
      over at space_info->block_groups and loop through again, with loop == 2.  If we
      come full circle and haven't found what we need then we exit with -ENOSPC.
      I've been running this for a couple of days now and it seems stable, and I
      haven't yet hit a -ENOSPC when there was plenty of space left.
      
      Also I've made a groups_sem to handle the group list for the space_info.  This
      is part of my locking changes, but is relatively safe and seems better than
      holding the space_info spinlock over that entire search time.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
       
      80eb234a
    • Yan Zheng's avatar
      Btrfs: Improve space balancing code · f82d02d9
      Yan Zheng authored
      This patch improves the space balancing code to keep more sharing
      of tree blocks. The only case that breaks sharing of tree blocks is
      data extents get fragmented during balancing. The main changes in
      this patch are:
      
      Add a 'drop sub-tree' function. This solves the problem in old code
      that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
      
      Remove relocation mapping tree. Relocation mappings are stored in
      struct btrfs_ref_path and updated dynamically during walking up/down
      the reference path. This reduces CPU usage and simplifies code.
      
      This patch also fixes a bug. Root items for reloc trees should be
      updated in btrfs_free_reloc_root.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      
      f82d02d9
    • Chris Mason's avatar
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason authored
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c8b97818
  2. 16 Oct, 2008 1 commit
  3. 10 Oct, 2008 2 commits
    • Josef Bacik's avatar
      Btrfs: make tree_search_offset more flexible in its searching · 37d3cddd
      Josef Bacik authored
      Sometimes we end up freeing a reserved extent because we don't need it, however
      this means that its possible for transaction->last_alloc to point to the middle
      of a free area.
      
      When we search for free space in find_free_space we do a tree_search_offset
      with contains set to 0, because we want it to find the next best free area if
      we do not have an offset starting on the given offset.
      
      Unfortunately that currently means that if the offset we were given as a hint
      points to the middle of a free area, we won't find anything.  This is especially
      bad if we happened to last allocate from the big huge chunk of a newly formed
      block group, since we won't find anything and have to go back and search the
      long way around.
      
      This fixes this problem by making it so that we return the free space area
      regardless of the contains variable.  This made cache missing happen _alot_
      less, and speeds things up considerably.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      37d3cddd
    • Chris Mason's avatar
      Btrfs: Don't call security_inode_mkdir during subvol creation · a3dddf3f
      Chris Mason authored
      Subvol creation already requires privs, and security_inode_mkdir isn't
      exported.  For now we don't need it.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a3dddf3f
  4. 09 Oct, 2008 18 commits
  5. 08 Oct, 2008 3 commits
  6. 07 Oct, 2008 7 commits
  7. 06 Oct, 2008 3 commits
    • Linus Torvalds's avatar
      Linux 2.6.27-rc9 · 4330ed8e
      Linus Torvalds authored
      4330ed8e
    • Mathieu Desnoyers's avatar
      Marker depmod fix core kernel list · 87f3b6b6
      Mathieu Desnoyers authored
      * Theodore Ts'o (tytso@mit.edu) wrote:
      >
      > I've been playing with adding some markers into ext4 to see if they
      > could be useful in solving some problems along with Systemtap.  It
      > appears, though, that as of 2.6.27-rc8, markers defined in code which is
      > compiled directly into the kernel (i.e., not as modules) don't show up
      > in Module.markers:
      >
      > kvm_trace_entryexit arch/x86/kvm/kvm-intel  %u %p %u %u %u %u %u %u
      > kvm_trace_handler arch/x86/kvm/kvm-intel  %u %p %u %u %u %u %u %u
      > kvm_trace_entryexit arch/x86/kvm/kvm-amd  %u %p %u %u %u %u %u %u
      > kvm_trace_handler arch/x86/kvm/kvm-amd  %u %p %u %u %u %u %u %u
      >
      > (Note the lack of any of the kernel_sched_* markers, and the markers I
      > added for ext4_* and jbd2_* are missing as wel.)
      >
      > Systemtap apparently depends on in-kernel trace_mark being recorded in
      > Module.markers, and apparently it's been claimed that it used to be
      > there.  Is this a bug in systemtap, or in how Module.markers is getting
      > built?   And is there a file that contains the equivalent information
      > for markers located in non-modules code?
      
      I think the problem comes from "markers: fix duplicate modpost entry"
      (commit d35cb360)
      
      Especially :
      
        -   add_marker(mod, marker, fmt);
        +   if (!mod->skip)
        +     add_marker(mod, marker, fmt);
          }
          return;
         fail:
      
      Here is a fix that should take care if this problem.
      
      Thanks for the bug report!
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Tested-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      CC: Greg KH <greg@kroah.com>
      CC: David Smith <dsmith@redhat.com>
      CC: Roland McGrath <roland@redhat.com>
      CC: Sam Ravnborg <sam@ravnborg.org>
      CC: Wenji Huang <wenji.huang@oracle.com>
      CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87f3b6b6
    • Linus Torvalds's avatar
      Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb · afed26d1
      Linus Torvalds authored
      * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
        kgdb: call touch_softlockup_watchdog on resume
        kgdb, x86: Avoid invoking kgdb_nmicallback twice per NMI
      afed26d1