1. 30 Oct, 2008 5 commits
  2. 29 Oct, 2008 6 commits
    • Chris Mason's avatar
    • Yan Zheng's avatar
      Btrfs: Add root tree pointer transaction ids · 84234f3a
      Yan Zheng authored
      This patch adds transaction IDs to root tree pointers.
      Transaction IDs in tree pointers are compared with the
      generation numbers in block headers when reading root
      blocks of trees. This can detect some types of IO errors.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      
      84234f3a
    • Josef Bacik's avatar
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik authored
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      
      25179201
    • Josef Bacik's avatar
      Btrfs: fix enospc when there is plenty of space · 80eb234a
      Josef Bacik authored
      So there is an odd case where we can possibly return -ENOSPC when there is in
      fact space to be had.  It only happens with Metadata writes, and happens _very_
      infrequently.  What has to happen is we have to allocate have allocated out of
      the first logical byte on the disk, which would set last_alloc to
      first_logical_byte(root, 0), so search_start == orig_search_start.  We then
      need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
      BTRFS_BLOCK_GROUP_DUP.  We will do a block lookup for the given search_start,
      block_group_bits() won't match and we'll go to choose another block group.
      However because search_start matches orig_search_start we go to see if we can
      allocate a chunk.
      
      If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
      This is kind of a big flaw of the way find_free_extent works, as it along with
      find_free_space loop through _all_ of the block groups, not just the ones that
      we want to allocate out of.  This patch completely kills find_free_space and
      rolls it into find_free_extent.  I've introduced a sort of state machine into
      this, which will make it easier to get cache miss information out of the
      allocator, and will work well with my locking changes.
      
      The basic flow is this:  We have the variable loop which is 0, meaning we are
      in the hint phase.  We lookup the block group for the hint, and lookup the
      space_info for what we want to allocate out of.  If the block group we were
      pointed at by the hint either isn't of the correct type, or just doesn't have
      the space we need, we set head to space_info->block_groups, so we start at the
      beginning of the block groups for this particular space info, and loop through.
      
      This is also where we add the empty_cluster to total_needed.  At this point
      loop is set to 1 and we just loop through all of the block groups for this
      particular space_info looking for the space we need, just as find_free_space
      would have done, except we only hit the block groups we want and not _all_ of
      the block groups.  If we come full circle we see if we can allocate a chunk.
      If we cannot of course we exit with -ENOSPC and we are good.  If not we start
      over at space_info->block_groups and loop through again, with loop == 2.  If we
      come full circle and haven't found what we need then we exit with -ENOSPC.
      I've been running this for a couple of days now and it seems stable, and I
      haven't yet hit a -ENOSPC when there was plenty of space left.
      
      Also I've made a groups_sem to handle the group list for the space_info.  This
      is part of my locking changes, but is relatively safe and seems better than
      holding the space_info spinlock over that entire search time.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
       
      80eb234a
    • Yan Zheng's avatar
      Btrfs: Improve space balancing code · f82d02d9
      Yan Zheng authored
      This patch improves the space balancing code to keep more sharing
      of tree blocks. The only case that breaks sharing of tree blocks is
      data extents get fragmented during balancing. The main changes in
      this patch are:
      
      Add a 'drop sub-tree' function. This solves the problem in old code
      that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
      
      Remove relocation mapping tree. Relocation mappings are stored in
      struct btrfs_ref_path and updated dynamically during walking up/down
      the reference path. This reduces CPU usage and simplifies code.
      
      This patch also fixes a bug. Root items for reloc trees should be
      updated in btrfs_free_reloc_root.
      Signed-off-by: default avatarYan Zheng <zheng.yan@oracle.com>
      
      f82d02d9
    • Chris Mason's avatar
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason authored
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c8b97818
  3. 16 Oct, 2008 1 commit
  4. 10 Oct, 2008 2 commits
    • Josef Bacik's avatar
      Btrfs: make tree_search_offset more flexible in its searching · 37d3cddd
      Josef Bacik authored
      Sometimes we end up freeing a reserved extent because we don't need it, however
      this means that its possible for transaction->last_alloc to point to the middle
      of a free area.
      
      When we search for free space in find_free_space we do a tree_search_offset
      with contains set to 0, because we want it to find the next best free area if
      we do not have an offset starting on the given offset.
      
      Unfortunately that currently means that if the offset we were given as a hint
      points to the middle of a free area, we won't find anything.  This is especially
      bad if we happened to last allocate from the big huge chunk of a newly formed
      block group, since we won't find anything and have to go back and search the
      long way around.
      
      This fixes this problem by making it so that we return the free space area
      regardless of the contains variable.  This made cache missing happen _alot_
      less, and speeds things up considerably.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      37d3cddd
    • Chris Mason's avatar
      Btrfs: Don't call security_inode_mkdir during subvol creation · a3dddf3f
      Chris Mason authored
      Subvol creation already requires privs, and security_inode_mkdir isn't
      exported.  For now we don't need it.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a3dddf3f
  5. 09 Oct, 2008 18 commits
  6. 08 Oct, 2008 3 commits
  7. 07 Oct, 2008 5 commits
    • Daniele Lacamera's avatar
      tcp: Fix tcp_hybla zero congestion window growth with small rho and large cwnd. · 9d2c27e1
      Daniele Lacamera authored
      Because of rounding, in certain conditions, i.e. when in congestion
      avoidance state rho is smaller than 1/128 of the current cwnd, TCP
      Hybla congestion control starves and the cwnd is kept constant
      forever.
      
      This patch forces an increment by one segment after #send_cwnd calls
      without increments(newreno behavior).
      Signed-off-by: default avatarDaniele Lacamera <root@danielinux.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d2c27e1
    • Herbert Xu's avatar
      net: Fix netdev_run_todo dead-lock · 58ec3b4d
      Herbert Xu authored
      Benjamin Thery tracked down a bug that explains many instances
      of the error
      
      unregister_netdevice: waiting for %s to become free. Usage count = %d
      
      It turns out that netdev_run_todo can dead-lock with itself if
      a second instance of it is run in a thread that will then free
      a reference to the device waited on by the first instance.
      
      The problem is really quite silly.  We were trying to create
      parallelism where none was required.  As netdev_run_todo always
      follows a RTNL section, and that todo tasks can only be added
      with the RTNL held, by definition you should only need to wait
      for the very ones that you've added and be done with it.
      
      There is no need for a second mutex or spinlock.
      
      This is exactly what the following patch does.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58ec3b4d
    • David S. Miller's avatar
    • Ali Saidi's avatar
      tcp: Fix possible double-ack w/ user dma · 53240c20
      Ali Saidi authored
      From: Ali Saidi <saidi@engin.umich.edu>
      
      When TCP receive copy offload is enabled it's possible that
      tcp_rcv_established() will cause two acks to be sent for a single
      packet. In the case that a tcp_dma_early_copy() is successful,
      copied_early is set to true which causes tcp_cleanup_rbuf() to be
      called early which can send an ack. Further along in
      tcp_rcv_established(), __tcp_ack_snd_check() is called and will
      schedule a delayed ACK. If no packets are processed before the delayed
      ack timer expires the packet will be acked twice.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53240c20
    • Patrick McHardy's avatar
      net: only invoke dev->change_rx_flags when device is UP · b6c40d68
      Patrick McHardy authored
      Jesper Dangaard Brouer <hawk@comx.dk> reported a bug when setting a VLAN
      device down that is in promiscous mode:
      
      When the VLAN device is set down, the promiscous count on the real
      device is decremented by one by vlan_dev_stop(). When removing the
      promiscous flag from the VLAN device afterwards, the promiscous
      count on the real device is decremented a second time by the
      vlan_change_rx_flags() callback.
      
      The root cause for this is that the ->change_rx_flags() callback is
      invoked while the device is down. The synchronization is meant to mirror
      the behaviour of the ->set_rx_mode callbacks, meaning the ->open function
      is responsible for doing a full sync on open, the ->close() function is
      responsible for doing full cleanup on ->stop() and ->change_rx_flags()
      is meant to do incremental changes while the device is UP.
      
      Only invoke ->change_rx_flags() while the device is UP to provide the
      intended behaviour.
      Tested-by: default avatarJesper Dangaard Brouer <jdb@comx.dk>
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6c40d68