1. 12 Aug, 2013 30 commits
  2. 30 Jul, 2013 1 commit
  3. 24 Jul, 2013 1 commit
    • Dave Chinner's avatar
      xfs: di_flushiter considered harmful · e60896d8
      Dave Chinner authored
      When we made all inode updates transactional, we no longer needed
      the log recovery detection for inodes being newer on disk than the
      transaction being replayed - it was redundant as replay of the log
      would always result in the latest version of the inode would be on
      disk. It was redundant, but left in place because it wasn't
      considered to be a problem.
      
      However, with the new "don't read inodes on create" optimisation,
      flushiter has come back to bite us. Essentially, the optimisation
      made always initialises flushiter to zero in the create transaction,
      and so if we then crash and run recovery and the inode already on
      disk has a non-zero flushiter it will skip recovery of that inode.
      As a result, log recovery does the wrong thing and we end up with a
      corrupt filesystem.
      
      Because we have to support old kernel to new kernel upgrades, we
      can't just get rid of the flushiter support in log recovery as we
      might be upgrading from a kernel that doesn't have fully transactional
      inode updates.  Unfortunately, for v4 superblocks there is no way to
      guarantee that log recovery knows about this fact.
      
      We cannot add a new inode format flag to say it's a "special inode
      create" because it won't be understood by older kernels and so
      recovery could do the wrong thing on downgrade. We cannot specially
      detect the combination of zero mode/non-zero flushiter on disk to
      non-zero mode, zero flushiter in the log item during recovery
      because wrapping of the flushiter can result in false detection.
      
      Hence that makes this "don't use flushiter" optimisation limited to
      a disk format that guarantees that we don't need it. And that means
      the only fix here is to limit the "no read IO on create"
      optimisation to version 5 superblocks....
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      e60896d8
  4. 22 Jul, 2013 4 commits
    • Chandra Seetharaman's avatar
      xfs: Start using pquotaino from the superblock. · d892d586
      Chandra Seetharaman authored
      Start using pquotino and define a macro to check if the
      superblock has pquotino.
      
      Keep backward compatibilty by alowing mount of older superblock
      with no separate pquota inode.
      Signed-off-by: default avatarChandra Seetharaman <sekharan@us.ibm.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      d892d586
    • Chandra Seetharaman's avatar
      xfs: Initialize all quota inodes to be NULLFSINO · 01026297
      Chandra Seetharaman authored
      mkfs doesn't initialize the quota inodes to NULLFSINO as it does for the
      other internal inodes. This leads to two in-core values (0 and NULLFSINO)
      to be checked against, to make sure if a quota inode is valid.
      
      Solve that problem by initializing the in-core values of all quotaino
      values to NULLFSINO if they are 0 in the disk.
      
      Note that these values are not written back to on-disk superblock unless
      some quota is enabled on the filesystem. Even in that case sb_pquotino is
      written to disk only if the on-disk superblock supports pquotino
      Signed-off-by: default avatarChandra Seetharaman <sekharan@us.ibm.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      01026297
    • Chandra Seetharaman's avatar
      xfs: Fix a deadlock in xfs_log_commit_cil() code path · 297aa637
      Chandra Seetharaman authored
      While testing and rearranging pquota/gquota code, I stumbled
      on a xfs_shutdown() during a mount. But the mount just hung.
      
      Debugged and found that there is a deadlock involving
      &log->l_cilp->xc_ctx_lock.
      
      It is in a code path where &log->l_cilp->xc_ctx_lock is first
      acquired in read mode and some levels down the same semaphore
      is being acquired in write mode causing a deadlock.
      
      This is the stack:
      xfs_log_commit_cil -> acquires &log->l_cilp->xc_ctx_lock in read mode
        xlog_print_tic_res
          xfs_force_shutdown
            xfs_log_force_umount
              xlog_cil_force
                xlog_cil_force_lsn
                  xlog_cil_push_foreground
                    xlog_cil_push - tries to acquire same semaphore in write mode
      
      This patch fixes the deadlock by changing the reason code for
      xfs_force_shutdown in xlog_print_tic_res() to SHUTDOWN_LOG_IO_ERROR.
      
      SHUTDOWN_LOG_IO_ERROR is the right reason code to be set since
      we are in the log path.
      
      Thanks to Dave for suggesting this solution.
      Signed-off-by: default avatarChandra Seetharaman <sekharan@us.ibm.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      297aa637
    • Jie Liu's avatar
      xfs: fix assertion failure in xfs_vm_write_failed() · 58e59854
      Jie Liu authored
      In xfs_vm_write_failed(), we evaluate the block_offset of pos with
      PAGE_MASK which is an unsigned long.  That is fine on 64-bit platforms
      regardless of whether the request pos is 32-bit or 64-bit.  However, on
      32-bit platforms the value is 0xfffff000 and so the high 32 bits in it
      will be masked off with (pos & PAGE_MASK) for a 64-bit pos.
      
      As a result, the evaluated block_offset is incorrect which will cause
      this failure ASSERT(block_offset + from == pos); and potentially pass
      the wrong block to xfs_vm_kill_delalloc_range().
      
      In this case, we can get a kernel panic if CONFIG_XFS_DEBUG is enabled:
      
      XFS: Assertion failed: block_offset + from == pos, file: fs/xfs/xfs_aops.c, line: 1504
      
      ------------[ cut here ]------------
       kernel BUG at fs/xfs/xfs_message.c:100!
       invalid opcode: 0000 [#1] SMP
       ........
       Pid: 4057, comm: mkfs.xfs Tainted: G           O 3.9.0-rc2 #1
       EIP: 0060:[<f94a7e8b>] EFLAGS: 00010282 CPU: 0
       EIP is at assfail+0x2b/0x30 [xfs]
       EAX: 00000056 EBX: f6ef28a0 ECX: 00000007 EDX: f57d22a4
       ESI: 1c2fb000 EDI: 00000000 EBP: ea6b5d30 ESP: ea6b5d1c
       DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 094f3ff4 CR3: 2bcb4000 CR4: 000006f0
       DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
       DR6: ffff0ff0 DR7: 00000400
       Process mkfs.xfs (pid: 4057, ti=ea6b4000 task=ea5799e0 task.ti=ea6b4000)
       Stack:
       00000000 f9525c48 f951fa80 f951f96b 000005e4 ea6b5d7c f9494b34 c19b0ea2
       00000066 f3d6c620 c19b0ea2 00000000 e9a91458 00001000 00000000 00000000
       00000000 c15c7e89 00000000 1c2fb000 00000000 00000000 1c2fb000 00000080
       Call Trace:
       [<f9494b34>] xfs_vm_write_failed+0x74/0x1b0 [xfs]
       [<c15c7e89>] ? printk+0x4d/0x4f
       [<f9494d7d>] xfs_vm_write_begin+0x10d/0x170 [xfs]
       [<c110a34c>] generic_file_buffered_write+0xdc/0x210
       [<f949b669>] xfs_file_buffered_aio_write+0xf9/0x190 [xfs]
       [<f949b7f3>] xfs_file_aio_write+0xf3/0x160 [xfs]
       [<c115e504>] do_sync_write+0x94/0xd0
       [<c115ed1f>] vfs_write+0x8f/0x160
       [<c115e470>] ? wait_on_retry_sync_kiocb+0x50/0x50
       [<c115f017>] sys_write+0x47/0x80
       [<c15d860d>] sysenter_do_call+0x12/0x28
       .............
       EIP: [<f94a7e8b>] assfail+0x2b/0x30 [xfs] SS:ESP 0068:ea6b5d1c
       ---[ end trace cdd9af4f4ecab42f ]---
       Kernel panic - not syncing: Fatal exception
      
      In order to avoid this, we can evaluate the block_offset of the start
      of the page by using shifts rather than masks the mismatch problem.
      
      Thanks Dave Chinner for help finding and fixing this bug.
      Reported-by: default avatarMichael L. Semon <mlsemon35@gmail.com>
      Reviewed-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      58e59854
  5. 14 Jul, 2013 4 commits
    • Linus Torvalds's avatar
      Linux 3.11-rc1 · ad81f054
      Linus Torvalds authored
      ad81f054
    • Linus Torvalds's avatar
      Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux · 54be8200
      Linus Torvalds authored
      Pull slab update from Pekka Enberg:
       "Highlights:
      
        - Fix for boot-time problems on some architectures due to
          init_lock_keys() not respecting kmalloc_caches boundaries
          (Christoph Lameter)
      
        - CONFIG_SLUB_CPU_PARTIAL requested by RT folks (Joonsoo Kim)
      
        - Fix for excessive slab freelist draining (Wanpeng Li)
      
        - SLUB and SLOB cleanups and fixes (various people)"
      
      I ended up editing the branch, and this avoids two commits at the end
      that were immediately reverted, and I instead just applied the oneliner
      fix in between myself.
      
      * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
        slub: Check for page NULL before doing the node_match check
        mm/slab: Give s_next and s_stop slab-specific names
        slob: Check for NULL pointer before calling ctor()
        slub: Make cpu partial slab support configurable
        slab: add kmalloc() to kernel API documentation
        slab: fix init_lock_keys
        slob: use DIV_ROUND_UP where possible
        slub: do not put a slab to cpu partial list when cpu_partial is 0
        mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo
        mm/slub: Drop unnecessary nr_partials
        mm/slab: Fix /proc/slabinfo unwriteable for slab
        mm/slab: Sharing s_next and s_stop between slab and slub
        mm/slab: Fix drain freelist excessively
        slob: Rework #ifdeffery in slab.h
        mm, slab: moved kmem_cache_alloc_node comment to correct place
      54be8200
    • Steven Rostedt's avatar
      slub: Check for page NULL before doing the node_match check · c25f195e
      Steven Rostedt authored
      In the -rt kernel (mrg), we hit the following dump:
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
      PGD a2d39067 PUD b1641067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in: sunrpc cpufreq_ondemand ipv6 tg3 joydev sg serio_raw pcspkr k8temp amd64_edac_mod edac_core i2c_piix4 e100 mii shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
      CPU 3
      Pid: 20878, comm: hackbench Not tainted 3.6.11-rt25.14.el6rt.x86_64 #1 empty empty/Tyan Transport GT24-B3992
      RIP: 0010:[<ffffffff811573f1>]  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
      RSP: 0018:ffff8800a9b17d70  EFLAGS: 00010213
      RAX: 0000000000000000 RBX: 0000000001200011 RCX: ffff8800a06d8000
      RDX: 0000000004d92a03 RSI: 00000000000000d0 RDI: ffff88013b805500
      RBP: ffff8800a9b17dc0 R08: ffff88023fd14d10 R09: ffffffff81041cbd
      R10: 00007f4e3f06e9d0 R11: 0000000000000246 R12: ffff88013b805500
      R13: ffff8801ff46af40 R14: 0000000000000001 R15: 0000000000000000
      FS:  00007f4e3f06e700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 00000000a2d3a000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process hackbench (pid: 20878, threadinfo ffff8800a9b16000, task ffff8800a06d8000)
      Stack:
       ffff8800a9b17da0 ffffffff81202e08 ffff8800a9b17de0 000000d001200011
       0000000001200011 0000000001200011 0000000000000000 0000000000000000
       00007f4e3f06e9d0 0000000000000000 ffff8800a9b17e60 ffffffff81041cbd
      Call Trace:
       [<ffffffff81202e08>] ? current_has_perm+0x68/0x80
       [<ffffffff81041cbd>] copy_process+0xdd/0x15b0
       [<ffffffff810a2125>] ? rt_up_read+0x25/0x30
       [<ffffffff8104369a>] do_fork+0x5a/0x360
       [<ffffffff8107c66b>] ? migrate_enable+0xeb/0x220
       [<ffffffff8100b068>] sys_clone+0x28/0x30
       [<ffffffff81527423>] stub_clone+0x13/0x20
       [<ffffffff81527152>] ? system_call_fastpath+0x16/0x1b
      Code: 89 fc 89 75 cc 41 89 d6 4d 8b 04 24 65 4c 03 04 25 48 ae 00 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 74 12 41 83 fe ff 74 27 <48> 8b 00 48 c1 e8 3a 41 39 c6 74 1b 8b 75 cc 4c 89 c9 44 89 f2
      RIP  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
       RSP <ffff8800a9b17d70>
      CR2: 0000000000000000
      ---[ end trace 0000000000000002 ]---
      
      Now, this uses SLUB pretty much unmodified, but as it is the -rt kernel
      with CONFIG_PREEMPT_RT set, spinlocks are mutexes, although they do
      disable migration. But the SLUB code is relatively lockless, and the
      spin_locks there are raw_spin_locks (not converted to mutexes), thus I
      believe this bug can happen in mainline without -rt features. The -rt
      patch is just good at triggering mainline bugs ;-)
      
      Anyway, looking at where this crashed, it seems that the page variable
      can be NULL when passed to the node_match() function (which does not
      check if it is NULL). When this happens we get the above panic.
      
      As page is only used in slab_alloc() to check if the node matches, if
      it's NULL I'm assuming that we can say it doesn't and call the
      __slab_alloc() code. Is this a correct assumption?
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c25f195e
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 41d9884c
      Linus Torvalds authored
      Pull more vfs stuff from Al Viro:
       "O_TMPFILE ABI changes, Oleg's fput() series, misc cleanups, including
        making simple_lookup() usable for filesystems with non-NULL s_d_op,
        which allows us to get rid of quite a bit of ugliness"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        sunrpc: now we can just set ->s_d_op
        cgroup: we can use simple_lookup() now
        efivarfs: we can use simple_lookup() now
        make simple_lookup() usable for filesystems that set ->s_d_op
        configfs: don't open-code d_alloc_name()
        __rpc_lookup_create_exclusive: pass string instead of qstr
        rpc_create_*_dir: don't bother with qstr
        llist: llist_add() can use llist_add_batch()
        llist: fix/simplify llist_add() and llist_add_batch()
        fput: turn "list_head delayed_fput_list" into llist_head
        fs/file_table.c:fput(): add comment
        Safer ABI for O_TMPFILE
      41d9884c