1. 04 Dec, 2014 3 commits
    • Gu Zheng's avatar
      aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer · c60aa7e8
      Gu Zheng authored
      commit 835f252c upstream.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=86831
      
      Markus reported that when shutting down mysqld (with AIO support,
      on a ext3 formatted Harddrive) leads to a negative number of dirty pages
      (underrun to the counter). The negative number results in a drastic reduction
      of the write performance because the page cache is not used, because the kernel
      thinks it is still 2 ^ 32 dirty pages open.
      
      Add a warn trace in __dec_zone_state will catch this easily:
      
      static inline void __dec_zone_state(struct zone *zone, enum
      	zone_stat_item item)
      {
           atomic_long_dec(&zone->vm_stat[item]);
      +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
      	atomic_long_read(&zone->vm_stat[item]) < 0);
           atomic_long_dec(&vm_stat[item]);
      }
      
      [   21.341632] ------------[ cut here ]------------
      [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
      cancel_dirty_page+0x164/0x224()
      [   21.355296] Modules linked in: wutbox_cp sata_mv
      [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
      [   21.366793] Workqueue: events free_ioctx
      [   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
      (show_stack+0x20/0x24)
      [   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
      (dump_stack+0x24/0x28)
      [   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
      (warn_slowpath_common+0x84/0x9c)
      [   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
      (warn_slowpath_null+0x2c/0x34)
      [   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
      (cancel_dirty_page+0x164/0x224)
      [   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
      (truncate_inode_page+0x8c/0x158)
      [   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
      (truncate_inode_pages_range+0x11c/0x53c)
      [   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
      [<c00c0f6c>] (truncate_pagecache+0x88/0xac)
      [   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
      (truncate_setsize+0x5c/0x74)
      [   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
      (put_aio_ring_file.isra.14+0x34/0x90)
      [   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
      [<c013b424>] (aio_free_ring+0x20/0xcc)
      [   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
      (free_ioctx+0x24/0x44)
      [   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
      (process_one_work+0x134/0x47c)
      [   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
      (worker_thread+0x130/0x414)
      [   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
      (kthread+0xd4/0xec)
      [   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
      (ret_from_fork+0x14/0x20)
      [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
      
      The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
      (bypasses the VFS dirty pages increment) when init, and aio fs uses
      *default_backing_dev_info* as the backing dev, which does not disable
      the dirty pages accounting capability.
      So truncating aio ring file will contribute to accounting dirty pages (VFS
      dirty pages decrement), then error occurs.
      
      The original goal is keeping these pages in memory (can not be reclaimed
      or swapped) in life-time via marking it dirty. But thinking more, we have
      already pinned pages via elevating the page's refcount, which can already
      achieve the goal, so the SetPageDirty seems unnecessary.
      
      In order to fix the issue, using the __set_page_dirty_no_writeback instead
      of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
      set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).
      
      With the above change, the dirty pages accounting can work well. But as we
      known, aio fs is an anonymous one, which should never cause any real write-back,
      we can ignore the dirty pages (write back) accounting by disabling the dirty
      pages (write back) accounting capability. So we introduce an aio private
      backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
      replace the default one.
      Reported-by: default avatarMarkus Königshaus <m.koenigshaus@wut.de>
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      c60aa7e8
    • Steven Capper's avatar
      ARM: 8109/1: mm: Modify pte_write and pmd_write logic for LPAE · 83daf276
      Steven Capper authored
      commit ded94779 upstream.
      
      For LPAE, we have the following means for encoding writable or dirty
      ptes:
                                    L_PTE_DIRTY       L_PTE_RDONLY
          !pte_dirty && !pte_write        0               1
          !pte_dirty && pte_write         0               1
          pte_dirty && !pte_write         1               1
          pte_dirty && pte_write          1               0
      
      So we can't distinguish between writeable clean ptes and read only
      ptes. This can cause problems with ptes being incorrectly flagged as
      read only when they are writeable but not dirty.
      
      This patch renumbers L_PTE_RDONLY from AP[2] to a software bit #58,
      and adds additional logic to set AP[2] whenever the pte is read only
      or not dirty. That way we can distinguish between clean writeable ptes
      and read only ptes.
      
      HugeTLB pages will use this new logic automatically.
      
      We need to add some logic to Transparent HugePages to ensure that they
      correctly interpret the revised pgprot permissions (L_PTE_RDONLY has
      moved and no longer matches PMD_SECT_AP2). In the process of revising
      THP, the names of the PMD software bits have been prefixed with L_ to
      make them easier to distinguish from their hardware bit counterparts.
      Signed-off-by: default avatarSteve Capper <steve.capper@linaro.org>
      Reviewed-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Cc: Hou Pengyang <houpengyang@huawei.com>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      83daf276
    • Steven Capper's avatar
      ARM: 8108/1: mm: Introduce {pte,pmd}_isset and {pte,pmd}_isclear · bffaa640
      Steven Capper authored
      commit f2950706 upstream.
      
      Long descriptors on ARM are 64 bits, and some pte functions such as
      pte_dirty return a bitwise-and of a flag with the pte value. If the
      flag to be tested resides in the upper 32 bits of the pte, then we run
      into the danger of the result being dropped if downcast.
      
      For example:
      	gather_stats(page, md, pte_dirty(*pte), 1);
      where pte_dirty(*pte) is downcast to an int.
      
      This patch introduces a new macro pte_isset which performs the bitwise
      and, then performs a double logical invert (where needed) to ensure
      predictable downcasting. The logical inverse pte_isclear is also
      introduced.
      
      Equivalent pmd functions for Transparent HugePages have also been
      added.
      Signed-off-by: default avatarSteve Capper <steve.capper@linaro.org>
      Reviewed-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Cc: Hou Pengyang <houpengyang@huawei.com>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      bffaa640
  2. 02 Dec, 2014 1 commit
  3. 01 Dec, 2014 36 commits