1. 13 Apr, 2010 2 commits
  2. 09 Apr, 2010 4 commits
    • Divyesh Shah's avatar
      blkio: Add more debug-only per-cgroup stats · 812df48d
      Divyesh Shah authored
      1) group_wait_time - This is the amount of time the cgroup had to wait to get a
        timeslice for one of its queues from when it became busy, i.e., went from 0
        to 1 request queued. This is different from the io_wait_time which is the
        cumulative total of the amount of time spent by each IO in that cgroup waiting
        in the scheduler queue. This stat is a great way to find out any jobs in the
        fleet that are being starved or waiting for longer than what is expected (due
        to an IO controller bug or any other issue).
      2) empty_time - This is the amount of time a cgroup spends w/o any pending
         requests. This stat is useful when a job does not seem to be able to use its
         assigned disk share by helping check if that is happening due to an IO
         controller bug or because the job is not submitting enough IOs.
      3) idle_time - This is the amount of time spent by the IO scheduler idling
         for a given cgroup in anticipation of a better request than the exising ones
         from other queues/cgroups.
      
      All these stats are recorded using start and stop events. When reading these
      stats, we do not add the delta between the current time and the last start time
      if we're between the start and stop events. We avoid doing this to make sure
      that these numbers are always monotonically increasing when read. Since we're
      using sched_clock() which may use the tsc as its source, it may induce some
      inconsistency (due to tsc resync across cpus) if we included the current delta.
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      812df48d
    • Divyesh Shah's avatar
      blkio: Add io_queued and avg_queue_size stats · cdc1184c
      Divyesh Shah authored
      These stats are useful for getting a feel for the queue depth of the cgroup,
      i.e., how filled up its queues are at a given instant and over the existence of
      the cgroup. This ability is useful when debugging problems in the wild as it
      helps understand the application's IO pattern w/o having to read through the
      userspace code (coz its tedious or just not available) or w/o the ability
      to run blktrace (since you may not have root access and/or not want to disturb
      performance).
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      cdc1184c
    • Divyesh Shah's avatar
      blkio: Add io_merged stat · 812d4026
      Divyesh Shah authored
      This includes both the number of bios merged into requests belonging to this
      cgroup as well as the number of requests merged together.
      In the past, we've observed different merging behavior across upstream kernels,
      some by design some actual bugs. This stat helps a lot in debugging such
      problems when applications report decreased throughput with a new kernel
      version.
      
      This needed adding an extra elevator function to capture bios being merged as I
      did not want to pollute elevator code with blkiocg knowledge and hence needed
      the accounting invocation to come from CFQ.
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      812d4026
    • Divyesh Shah's avatar
      blkio: Changes to IO controller additional stats patches · 84c124da
      Divyesh Shah authored
      that include some minor fixes and addresses all comments.
      
      Changelog: (most based on Vivek Goyal's comments)
      o renamed blkiocg_reset_write to blkiocg_reset_stats
      o more clarification in the documentation on io_service_time and io_wait_time
      o Initialize blkg->stats_lock
      o rename io_add_stat to blkio_add_stat and declare it static
      o use bool for direction and sync
      o derive direction and sync info from existing rq methods
      o use 12 for major:minor string length
      o define io_service_time better to cover the NCQ case
      o add a separate reset_stats interface
      o make the indexed stats a 2d array to simplify macro and function pointer code
      o blkio.time now exports in jiffies as before
      o Added stats description in patch description and
        Documentation/cgroup/blkio-controller.txt
      o Prefix all stats functions with blkio and make them static as applicable
      o replace IO_TYPE_MAX with IO_TYPE_TOTAL
      o Moved #define constant to top of blk-cgroup.c
      o Pass dev_t around instead of char *
      o Add note to documentation file about resetting stats
      o use BLK_CGROUP_MODULE in addition to BLK_CGROUP config option in #ifdef
        statements
      o Avoid struct request specific knowledge in blk-cgroup. blk-cgroup.h now has
        rq_direction() and rq_sync() functions which are used by CFQ and when using
        io-controller at a higher level, bio_* functions can be added.
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      84c124da
  3. 06 Apr, 2010 1 commit
    • Matthew Garrett's avatar
      laptop-mode: Make flushes per-device · 31373d09
      Matthew Garrett authored
      One of the features of laptop-mode is that it forces a writeout of dirty
      pages if something else triggers a physical read or write from a device.
      The current implementation flushes pages on all devices, rather than only
      the one that triggered the flush. This patch alters the behaviour so that
      only the recently accessed block device is flushed, preventing other
      disks being spun up for no terribly good reason.
      Signed-off-by: default avatarMatthew Garrett <mjg@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      31373d09
  4. 02 Apr, 2010 7 commits
  5. 30 Mar, 2010 6 commits
  6. 29 Mar, 2010 20 commits
    • Linus Torvalds's avatar
      Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 · 9623e5a2
      Linus Torvalds authored
      * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
        ocfs2: Fix a race in o2dlm lockres mastery
        Ocfs2: Handle deletion of reflinked oprhan inodes correctly.
        Ocfs2: Journaling i_flags and i_orphaned_slot when adding inode to orphan dir.
        ocfs2: Clear undo bits when local alloc is freed
        ocfs2: Init meta_ac properly in ocfs2_create_empty_xattr_block.
        ocfs2: Fix the update of name_offset when removing xattrs
        ocfs2: Always try for maximum bits with new local alloc windows
        ocfs2: set i_mode on disk during acl operations
        ocfs2: Update i_blocks in reflink operations.
        ocfs2: Change bg_chain check for ocfs2_validate_gd_parent.
        [PATCH] Skip check for mandatory locks when unlocking
      9623e5a2
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · 9f321603
      Linus Torvalds authored
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (28 commits)
        ceph: update discussion list address in MAINTAINERS
        ceph: some documentations fixes
        ceph: fix use after free on mds __unregister_request
        ceph: avoid loaded term 'OSD' in documention
        ceph: fix possible double-free of mds request reference
        ceph: fix session check on mds reply
        ceph: handle kmalloc() failure
        ceph: propagate mds session allocation failures to caller
        ceph: make write_begin wait propagate ERESTARTSYS
        ceph: fix snap rebuild condition
        ceph: avoid reopening osd connections when address hasn't changed
        ceph: rename r_sent_stamp r_stamp
        ceph: fix connection fault con_work reentrancy problem
        ceph: prevent dup stale messages to console for restarting mds
        ceph: fix pg pool decoding from incremental osdmap update
        ceph: fix mds sync() race with completing requests
        ceph: only release unused caps with mds requests
        ceph: clean up handle_cap_grant, handle_caps wrt session mutex
        ceph: fix session locking in handle_caps, ceph_check_caps
        ceph: drop unnecessary WARN_ON in caps migration
        ...
      9f321603
    • Linus Torvalds's avatar
      Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging · 9d54e2c0
      Linus Torvalds authored
      * 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
        hwmon: (asc7621) Add X58 entry in Kconfig
        hwmon: (w83793) Saving negative errors in unsigned
        hwmon: (coretemp) Add missing newline to dev_warn() message
        hwmon: (coretemp) Fix cpu model output
      9d54e2c0
    • Linus Torvalds's avatar
      7b128872
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · 6631424f
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (33 commits)
        r8169: offical fix for CVE-2009-4537 (overlength frame DMAs)
        ipv6: Don't drop cache route entry unless timer actually expired.
        tulip: Add missing parens.
        r8169: fix broken register writes
        pcnet_cs: add new id
        bonding: fix broken multicast with round-robin mode
        drivers/net: Fix continuation lines
        e1000: do not modify tx_queue_len on link speed change
        net: ipmr/ip6mr: prevent out-of-bounds vif_table access
        ixgbe: Do not run all Diagnostic offline tests when VFs are active
        igb: use correct bits to identify if managability is enabled
        benet: Fix compile warnnings in drivers/net/benet/be_ethtool.c
        net: Add MSG_WAITFORONE flag to recvmmsg
        e1000e: do not modify tx_queue_len on link speed change
        igbvf: do not modify tx_queue_len on link speed change
        ipv4: Restart rt_intern_hash after emergency rebuild (v2)
        ipv4: Cleanup struct net dereference in rt_intern_hash
        net: fix netlink address dumping in IPv4/IPv6
        tulip: Fix null dereference in uli526x_rx_packet()
        gianfar: fix undo of reserve()
        ...
      6631424f
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6 · c45140a9
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
        sparc64: Properly truncate pt_regs framepointer in perf callback.
        arch/sparc/kernel: Use set_cpus_allowed_ptr
        sparc: Fix use of uid16_t and gid16_t in asm/stat.h
      c45140a9
    • Linus Torvalds's avatar
      ext3: fix broken handling of EXT3_STATE_NEW · de329820
      Linus Torvalds authored
      In commit 9df93939 ("ext3: Use bitops to read/modify
      EXT3_I(inode)->i_state") ext3 changed its internal 'i_state' variable to
      use bitops for its state handling.  However, unline the same ext4
      change, it didn't actually change the name of the field when it changed
      the semantics of it.
      
      As a result, an old use of 'i_state' remained in fs/ext3/ialloc.c that
      initialized the field to EXT3_STATE_NEW.  And that does not work
      _at_all_ when we're now working with individually named bits rather than
      values that get masked.  So the code tried to mark the state to be new,
      but in actual fact set the field to EXT3_STATE_JDATA.  Which makes no
      sense at all, and screws up all the code that checks whether the inode
      was newly allocated.
      
      In particular, it made the xattr code unhappy, and caused various random
      behavior, like apparently
      
      	https://bugzilla.redhat.com/show_bug.cgi?id=577911
      
      So fix the initialization, and rename the field to match ext4 so that we
      don't have this happen again.
      
      Cc: James Morris <jmorris@namei.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Daniel J Walsh <dwalsh@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de329820
    • Neil Horman's avatar
      r8169: offical fix for CVE-2009-4537 (overlength frame DMAs) · c0cd884a
      Neil Horman authored
      Official patch to fix the r8169 frame length check error.
      
      Based on this initial thread:
      http://marc.info/?l=linux-netdev&m=126202972828626&w=1
      This is the official patch to fix the frame length problems in the r8169
      driver.  As noted in the previous thread, while this patch incurs a performance
      hit on the driver, its possible to improve performance dynamically by updating
      the mtu and rx_copybreak values at runtime to return performance to what it was
      for those NICS which are unaffected by the ideosyncracy (if there are any).
      
      Summary:
      
          A while back Eric submitted a patch for r8169 in which the proper
      allocated frame size was written to RXMaxSize to prevent the NIC from dmaing too
      much data.  This was done in commit fdd7b4c3.  A
      long time prior to that however, Francois posted
      126fa4b9, which expiclitly disabled the MaxSize
      setting due to the fact that the hardware behaved in odd ways when overlong
      frames were received on NIC's supported by this driver.  This was mentioned in a
      security conference recently:
      http://events.ccc.de/congress/2009/Fahrplan//events/3596.en.html
      
      It seems that if we can't enable frame size filtering, then, as Eric correctly
      noticed, we can find ourselves DMA-ing too much data to a buffer, causing
      corruption.  As a result is seems that we are forced to allocate a frame which
      is ready to handle a maximally sized receive.
      
      This obviously has performance issues with it, so to mitigate that issue, this
      patch does two things:
      
      1) Raises the copybreak value to the frame allocation size, which should force
      appropriately sized packets to get allocated on rx, rather than a full new 16k
      buffer.
      
      2) This patch only disables frame filtering initially (i.e., during the NIC
      open), changing the MTU results in ring buffer allocation of a size in relation
      to the new mtu (along with a warning indicating that this is dangerous).
      
      Because of item (2), individuals who can't cope with the performance hit (or can
      otherwise filter frames to prevent the bug), or who have hardware they are sure
      is unaffected by this issue, can manually lower the copybreak and reset the mtu
      such that performance is restored easily.
      Signed-off-by: default avatarNeil Horman <nhorman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0cd884a
    • David S. Miller's avatar
      sparc64: Properly truncate pt_regs framepointer in perf callback. · 9e8307ec
      David S. Miller authored
      For 32-bit processes, we save the full 64-bits of the regs in pt_regs.
      
      But unlike when the userspace actually does load and store
      instructions, the top 32-bits don't get automatically truncated by the
      cpu in kernel mode (because the kernel doesn't execute with PSTATE_AM
      address masking enabled).
      
      So we have to do it by hand.
      Reported-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e8307ec
    • Jaswinder Singh Rajput's avatar
      hwmon: (asc7621) Add X58 entry in Kconfig · b00d8a7e
      Jaswinder Singh Rajput authored
      Intel X58 have asc7621a chip. So added X58 entry in Kconfig for asc7621.
      Also arranged existing models in ascending order.
      Signed-off-by: default avatarJaswinder Singh Rajput <jaswinderrajput@gmail.com>
      Signed-off-by: default avatarJean Delvare <khali@linux-fr.org>
      b00d8a7e
    • Dan Carpenter's avatar
      hwmon: (w83793) Saving negative errors in unsigned · 3f7cd7ea
      Dan Carpenter authored
      "ret" is used to store the return value for watchdog_trigger() and it
      should be signed for the error handling to work.
      Signed-off-by: default avatarDan Carpenter <error27@gmail.com>
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarJean Delvare <khali@linux-fr.org>
      3f7cd7ea
    • Dean Nelson's avatar
      hwmon: (coretemp) Add missing newline to dev_warn() message · 4d7a5644
      Dean Nelson authored
      Add missing newline to dev_warn() message string. This is more of an issue
      with older kernels that don't automatically add a newline if it was missing
      from the end of the previous line.
      Signed-off-by: default avatarDean Nelson <dnelson@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJean Delvare <khali@linux-fr.org>
      4d7a5644
    • Prarit Bhargava's avatar
      hwmon: (coretemp) Fix cpu model output · fcc6a746
      Prarit Bhargava authored
      Avoid hex and decimal confusion when printing out the cpu model.
      Signed-off-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: default avatarJean Delvare <khali@linux-fr.org>
      fcc6a746
    • Joern Engel's avatar
      [LogFS] Erase new journal segments · 6be7fa06
      Joern Engel authored
      If the device contains on old logfs image and the journal is moved to
      segment that have never been used by the current logfs and not all
      journal segments are erased before the next mount, the old content can
      confuse mount code.  To prevent this, always erase the new journal
      segments.
      Signed-off-by: default avatarJoern Engel <joern@logfs.org>
      6be7fa06
    • Joern Engel's avatar
      [LogFS] Move reserved segments with journal · 0943846a
      Joern Engel authored
      Fixes a GC livelock.
      Signed-off-by: default avatarJoern Engel <joern@logfs.org>
      0943846a
    • Ian Campbell's avatar
      x86: Do not free zero sized per cpu areas · eed63519
      Ian Campbell authored
      This avoids an infinite loop in free_early_partial().
      
      Add a warning to free_early_partial() to catch future problems.
      
      -v5: put back start > end back into WARN_ONCE()
      -v6: use one line for warning, suggested by Linus
      -v7: more tests
      -v8: remove the function name as suggested by Johannes
           WARN_ONCE() will print out that function name.
      Signed-off-by: default avatarIan Campbell <ian.campbell@citrix.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Tested-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Tested-by: default avatarJoel Becker <joel.becker@oracle.com>
      Tested-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <1269830604-26214-4-git-send-email-yinghai@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      eed63519
    • Yinghai Lu's avatar
      x86: Make sure free_init_pages() frees pages on page boundary · c967da6a
      Yinghai Lu authored
      When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
      in a more compact fashion.
      
      Example:
      
       Allocated new RAMDISK: 00ec2000 - 0248ce57
       Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
      
      The new RAMDISK's end is not page aligned.
      Last page could be shared with other users.
      
      When free_init_pages are called for initrd or .init, the page
      could be freed and we could corrupt other data.
      
      code segment in free_init_pages():
      
       |        for (; addr < end; addr += PAGE_SIZE) {
       |                ClearPageReserved(virt_to_page(addr));
       |                init_page_count(virt_to_page(addr));
       |                memset((void *)(addr & ~(PAGE_SIZE-1)),
       |                        POISON_FREE_INITMEM, PAGE_SIZE);
       |                free_page(addr);
       |                totalram_pages++;
       |        }
      
      last half page could be used as one whole free page.
      
      So page align the boundaries.
      
      -v2: make the original initramdisk to be aligned, according to
           Johannes, otherwise we have the chance to lose one page.
           we still need to keep initrd_end not aligned, otherwise it could
           confuse decompressor.
      -v3: change to WARN_ON instead, suggested by Johannes.
      -v4: use PAGE_ALIGN, suggested by Johannes.
           We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
           Add comments about assuming ramdisk start is aligned
           in relocate_initrd(), change to re get ramdisk_image instead of save it
           to make diff smaller. Add warning for wrong range, suggested by Johannes.
      -v6: remove one WARN()
           We need to align beginning in free_init_pages()
           do not copy more than ramdisk_size, noticed by Johannes
      Reported-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Tested-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c967da6a
    • Sage Weil's avatar
      82593f87
    • Cheng Renquan's avatar
      ceph: some documentations fixes · 8136b58d
      Cheng Renquan authored
      New documentation should have an entry in the 00-INDEX.  Correct git
      urls.
      Signed-off-by: default avatarCheng Renquan <crquan@gmail.com>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      8136b58d
    • Yinghai Lu's avatar
      x86: Make smp_locks end with page alignment · 596b711e
      Yinghai Lu authored
      Fix:
      
       ------------[ cut here ]------------
       WARNING: at arch/x86/mm/init.c:342 free_init_pages+0x4c/0xfa()
       free_init_pages: range [0x40daf000, 0x40db5c24] is not aligned
       Modules linked in:
       Pid: 0, comm: swapper Not tainted
       2.6.34-rc2-tip-03946-g4f16b23-dirty #50 Call Trace:
        [<40232e9f>] warn_slowpath_common+0x65/0x7c
        [<4021c9f0>] ? free_init_pages+0x4c/0xfa
        [<40881434>] ? _etext+0x0/0x24
        [<40232eea>] warn_slowpath_fmt+0x24/0x27
        [<4021c9f0>] free_init_pages+0x4c/0xfa
        [<40881434>] ? _etext+0x0/0x24
        [<40d3f4bd>] alternative_instructions+0xf6/0x100
        [<40d3fe4f>] check_bugs+0xbd/0xbf
        [<40d398a7>] start_kernel+0x2d5/0x2e4
        [<40d390ce>] i386_start_kernel+0xce/0xd5
       ---[ end trace 4eaa2a86a8e2da22 ]---
      
      Comments in vmlinux.lds.S already said:
      
       |        /*
       |         * smp_locks might be freed after init
       |         * start/end must be page aligned
       |         */
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <1269830604-26214-2-git-send-email-yinghai@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      596b711e