1. 24 Feb, 2021 2 commits
  2. 03 Dec, 2020 1 commit
    • Roman Gushchin's avatar
      mm: memcontrol: Use helpers to read page's memcg data · bcfe06bf
      Roman Gushchin authored
      
      Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
      
      Currently a non-slab kernel page which has been charged to a memory cgroup
      can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
      flag is defined as a page type (like buddy, offline, etc), so it takes a
      bit from a page->mapped counter.  Pages with a type set can't be mapped to
      userspace.
      
      But in general the kmemcg flag has nothing to do with mapping to
      userspace.  It only means that the page has been accounted by the page
      allocator, so it has to be properly uncharged on release.
      
      Some bpf maps are mapping the vmalloc-based memory to userspace, and their
      memory can't be accounted because of this implementation detail.
      
      This patchset removes this limitation by moving the PageKmemcg flag into
      one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
      accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
      adds several checks and removes a couple of obsolete functions.  As the
      result the code became more robust with fewer open-coded bit tricks.
      
      This patch (of 4):
      
      Currently there are many open-coded reads of the page->mem_cgroup pointer,
      as well as a couple of read helpers, which are barely used.
      
      It creates an obstacle on a way to reuse some bits of the pointer for
      storing additional bits of information.  In fact, we already do this for
      slab pages, where the last bit indicates that a pointer has an attached
      vector of objcg pointers instead of a regular memcg pointer.
      
      This commits uses 2 existing helpers and introduces a new helper to
      converts all read sides to calls of these helpers:
        struct mem_cgroup *page_memcg(struct page *page);
        struct mem_cgroup *page_memcg_rcu(struct page *page);
        struct mem_cgroup *page_memcg_check(struct page *page);
      
      page_memcg_check() is intended to be used in cases when the page can be a
      slab page and have a memcg pointer pointing at objcg vector.  It does
      check the lowest bit, and if set, returns NULL.  page_memcg() contains a
      VM_BUG_ON_PAGE() check for the page not being a slab page.
      
      To make sure nobody uses a direct access, struct page's
      mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
      Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
      Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
      bcfe06bf
  3. 01 Dec, 2020 1 commit
  4. 18 Oct, 2020 1 commit
    • Roman Gushchin's avatar
      mm, memcg: rework remote charging API to support nesting · b87d8cef
      Roman Gushchin authored
      Currently the remote memcg charging API consists of two functions:
      memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
      memcg value, which overwrites the memcg of the current task.
      
        memalloc_use_memcg(target_memcg);
        <...>
        memalloc_unuse_memcg();
      
      It works perfectly for allocations performed from a normal context,
      however an attempt to call it from an interrupt context or just nest two
      remote charging blocks will lead to an incorrect accounting.  On exit from
      the inner block the active memcg will be cleared instead of being
      restored.
      
        memalloc_use_memcg(target_memcg);
      
        memalloc_use_memcg(target_memcg_2);
          <...>
          memalloc_unuse_memcg();
      
          Error: allocation here are charged to the memcg of the current
          process instead of target_memcg.
      
        memalloc_unuse_memcg();
      
      This patch extends the remote charging API by switching to a single
      function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
      which sets the new value and returns the old one.  So a remote charging
      block will look like:
      
        old_memcg = set_active_memcg(target_memcg);
        <...>
        set_active_memcg(old_memcg);
      
      This patch is heavily based on the patch by Johannes Weiner, which can be
      found here: https://lkml.org/lkml/2020/5/28/806
      
       .
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Schatzberg <dschatzberg@fb.com>
      Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b87d8cef
  5. 07 Sep, 2020 1 commit
    • Jan Kara's avatar
      fs: Don't invalidate page buffers in block_write_full_page() · 6dbf7bb5
      Jan Kara authored
      If block_write_full_page() is called for a page that is beyond current
      inode size, it will truncate page buffers for the page and return 0.
      This logic has been added in 2.5.62 in commit 81eb6906
      
       ("fix ext3
      BUG due to race with truncate") in history.git tree to fix a problem
      with ext3 in data=ordered mode. This particular problem doesn't exist
      anymore because ext3 is long gone and ext4 handles ordered data
      differently. Also normally buffers are invalidated by truncate code and
      there's no need to specially handle this in ->writepage() code.
      
      This invalidation of page buffers in block_write_full_page() is causing
      issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk
      under filesystem's hands and metadata buffers get discarded while being
      tracked by the journalling layer. Although it is obviously "not
      supported" it can cause kernel crashes like:
      
      [ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at
      +0000000000000008
      [ 7986.697197] PGD 0 P4D 0
      [ 7986.699724] Oops: 0002 [#1] SMP PTI
      [ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G
      +O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
      [ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
      [ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
      ...
      [ 7986.810150] Call Trace:
      [ 7986.812595]  __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
      [ 7986.818408]  jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
      [ 7986.836467]  kjournald2+0xbd/0x270 [jbd2]
      
      which is not great. The crash happens because bh->b_private is suddently
      NULL although BH_JBD flag is still set (this is because
      block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup
      found buffer without BH_Mapped set, called init_page_buffers() which has
      rewritten bh->b_private). So just remove the invalidation in
      block_write_full_page().
      
      Note that the buffer cache invalidation when block device changes size
      is already careful to avoid similar problems by using
      invalidate_mapping_pages() which skips busy buffers so it was only this
      odd block_write_full_page() behavior that could tear down bdev buffers
      under filesystem's hands.
      Reported-by: default avatarYe Bin <yebin10@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6dbf7bb5
  6. 23 Aug, 2020 1 commit
  7. 07 Aug, 2020 1 commit
    • Xianting Tian's avatar
      fs: prevent BUG_ON in submit_bh_wbc() · 377254b2
      Xianting Tian authored
      If a device is hot-removed --- for example, when a physical device is
      unplugged from pcie slot or a nbd device's network is shutdown ---
      this can result in a BUG_ON() crash in submit_bh_wbc().  This is
      because the when the block device dies, the buffer heads will have
      their Buffer_Mapped flag get cleared, leading to the crash in
      submit_bh_wbc.
      
      We had attempted to work around this problem in commit a17712c8
      ("ext4: check superblock mapped prior to committing").  Unfortunately,
      it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the
      device dies between when the work-around check in ext4_commit_super()
      and when submit_bh_wbh() is finally called:
      
      Code path:
      ext4_commit_super
          judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8
      
      
                lock_buffer(sbh)
                ...
                unlock_buffer(sbh)
                     __sync_dirty_buffer(sbh,...
                          lock_buffer(sbh)
                              judge if 'buffer_mapped(sbh))' is false, return <== added by this patch
                                  submit_bh(...,sbh)
                                      submit_bh_wbc(...,sbh,...)
      
      [100722.966497] kernel BUG at fs/buffer.c:3095! <== BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc()
      [100722.966503] invalid opcode: 0000 [#1] SMP
      [100722.966566] task: ffff8817e15a9e40 task.stack: ffffc90024744000
      [100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190
      [100722.966575] RSP: 0018:ffffc90024747a90 EFLAGS: 00010246
      [100722.966576] RAX: 0000000000620005 RBX: ffff8818a80603a8 RCX: 0000000000000000
      [100722.966576] RDX: ffff8818a80603a8 RSI: 0000000000020800 RDI: 0000000000000001
      [100722.966577] RBP: ffffc90024747ac0 R08: 0000000000000000 R09: ffff88207f94170d
      [100722.966578] R10: 00000000000437c8 R11: 0000000000000001 R12: 0000000000020800
      [100722.966578] R13: 0000000000000001 R14: 000000000bf9a438 R15: ffff88195f333000
      [100722.966580] FS:  00007fa2eee27700(0000) GS:ffff88203d840000(0000) knlGS:0000000000000000
      [100722.966580] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [100722.966581] CR2: 0000000000f0b008 CR3: 000000201a622003 CR4: 00000000007606e0
      [100722.966582] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [100722.966583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [100722.966583] PKRU: 55555554
      [100722.966583] Call Trace:
      [100722.966588]  __sync_dirty_buffer+0x6e/0xd0
      [100722.966614]  ext4_commit_super+0x1d8/0x290 [ext4]
      [100722.966626]  __ext4_std_error+0x78/0x100 [ext4]
      [100722.966635]  ? __ext4_journal_get_write_access+0xca/0x120 [ext4]
      [100722.966646]  ext4_reserve_inode_write+0x58/0xb0 [ext4]
      [100722.966655]  ? ext4_dirty_inode+0x48/0x70 [ext4]
      [100722.966663]  ext4_mark_inode_dirty+0x53/0x1e0 [ext4]
      [100722.966671]  ? __ext4_journal_start_sb+0x6d/0xf0 [ext4]
      [100722.966679]  ext4_dirty_inode+0x48/0x70 [ext4]
      [100722.966682]  __mark_inode_dirty+0x17f/0x350
      [100722.966686]  generic_update_time+0x87/0xd0
      [100722.966687]  touch_atime+0xa9/0xd0
      [100722.966690]  generic_file_read_iter+0xa09/0xcd0
      [100722.966694]  ? page_cache_tree_insert+0xb0/0xb0
      [100722.966704]  ext4_file_read_iter+0x4a/0x100 [ext4]
      [100722.966707]  ? __inode_security_revalidate+0x4f/0x60
      [100722.966709]  __vfs_read+0xec/0x160
      [100722.966711]  vfs_read+0x8c/0x130
      [100722.966712]  SyS_pread64+0x87/0xb0
      [100722.966716]  do_syscall_64+0x67/0x1b0
      [100722.966719]  entry_SYSCALL64_slow_path+0x25/0x25
      
      To address this, add the check of 'buffer_mapped(bh)' to
      __sync_dirty_buffer().  This also has the benefit of fixing this for
      other file systems.
      
      With this addition, we can drop the workaround in ext4_commit_supper().
      
      [ Commit description rewritten by tytso. ]
      Signed-off-by: default avatarXianting Tian <xianting_tian@126.com>
      Link: https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_tian@126.com
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      377254b2
  8. 08 Jul, 2020 1 commit
  9. 01 Jul, 2020 1 commit
  10. 02 Jun, 2020 2 commits
  11. 18 Apr, 2020 1 commit
  12. 16 Apr, 2020 1 commit
    • Roman Gushchin's avatar
      ext4: use non-movable memory for superblock readahead · d87f6392
      Roman Gushchin authored
      Since commit a8ac900b ("ext4: use non-movable memory for the
      superblock") buffers for ext4 superblock were allocated using
      the sb_bread_unmovable() helper which allocated buffer heads
      out of non-movable memory blocks. It was necessarily to not block
      page migrations and do not cause cma allocation failures.
      
      However commit 85c8f176
      
       ("ext4: preload block group descriptors")
      broke this by introducing pre-reading of the ext4 superblock.
      The problem is that __breadahead() is using __getblk() underneath,
      which allocates buffer heads out of movable memory.
      
      It resulted in page migration failures I've seen on a machine
      with an ext4 partition and a preallocated cma area.
      
      Fix this by introducing sb_breadahead_unmovable() and
      __breadahead_gfp() helpers which use non-movable memory for buffer
      head allocations and use them for the ext4 superblock readahead.
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Fixes: 85c8f176
      
       ("ext4: preload block group descriptors")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d87f6392
  13. 28 Mar, 2020 1 commit
  14. 25 Mar, 2020 1 commit
  15. 24 Jan, 2020 1 commit
  16. 09 Jan, 2020 1 commit
    • Ming Lei's avatar
      fs: move guard_bio_eod() after bio_set_op_attrs · 83c9c547
      Ming Lei authored
      Commit 85a8ce62 ("block: add bio_truncate to fix guard_bio_eod")
      adds bio_truncate() for handling bio EOD. However, bio_truncate()
      doesn't use the passed 'op' parameter from guard_bio_eod's callers.
      
      So bio_trunacate() may retrieve wrong 'op', and zering pages may
      not be done for READ bio.
      
      Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
      in submit_bh_wbc() so that bio_truncate() can always retrieve correct
      op info.
      
      Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
      used any more.
      
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Fixes: 85a8ce62
      
       ("block: add bio_truncate to fix guard_bio_eod")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      
      Fold in kerneldoc and bio_op() change.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      83c9c547
  17. 28 Dec, 2019 1 commit
    • Ming Lei's avatar
      block: add bio_truncate to fix guard_bio_eod · 85a8ce62
      Ming Lei authored
      Some filesystem, such as vfat, may send bio which crosses device boundary,
      and the worse thing is that the IO request starting within device boundaries
      can contain more than one segment past EOD.
      
      Commit dce30ca9 ("fs: fix guard_bio_eod to check for real EOD errors")
      tries to fix this issue by returning -EIO for this situation. However,
      this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb()
      may hang for ever.
      
      Also the current truncating on last segment is dangerous by updating the
      last bvec, given bvec table becomes not immutable any more, and fs bio
      users may not retrieve the truncated pages via bio_for_each_segment_all() in
      its .end_io callback.
      
      Fixes this issue by supporting multi-segment truncating. And the
      approach is simpler:
      
      - just update bio size since block layer can make correct bvec with
      the updated bio size. Then bvec table becomes really immutable.
      
      - zero all truncated segments for read bio
      
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Fixed-by: dce30ca9
      
       ("fs: fix guard_bio_eod to check for real EOD errors")
      Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85a8ce62
  18. 01 Dec, 2019 2 commits
  19. 14 Nov, 2019 1 commit
    • Eric Biggers's avatar
      fs/buffer.c: support fscrypt in block_read_full_page() · 31fb992c
      Eric Biggers authored
      
      After each filesystem block (as represented by a buffer_head) has been
      read from disk by block_read_full_page(), decrypt it if needed.  The
      decryption is done on the fscrypt_read_workqueue.
      
      This is the final change needed to support ext4 encryption with
      blocksize != PAGE_SIZE, and it's a fairly small change now that
      CONFIG_FS_ENCRYPTION is a bool and fs/crypto/ exposes functions to
      decrypt individual blocks and to enqueue work on the fscrypt workqueue.
      
      Don't try to add fs-verity support yet, as the fs/verity/ support layer
      isn't ready for sub-page blocks yet.  Just add fscrypt support for now.
      
      Almost all the new code is compiled away when CONFIG_FS_ENCRYPTION=n.
      
      Cc: Chandan Rajendra <chandan@linux.ibm.com>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20191023033312.361355-2-ebiggers@kernel.org
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      31fb992c
  20. 10 Jul, 2019 1 commit
  21. 28 Jun, 2019 1 commit
  22. 21 May, 2019 1 commit
  23. 01 May, 2019 2 commits
  24. 28 Feb, 2019 1 commit
    • Carlos Maiolino's avatar
      fs: fix guard_bio_eod to check for real EOD errors · dce30ca9
      Carlos Maiolino authored
      
      guard_bio_eod() can truncate a segment in bio to allow it to do IO on
      odd last sectors of a device.
      
      It already checks if the IO starts past EOD, but it does not consider
      the possibility of an IO request starting within device boundaries can
      contain more than one segment past EOD.
      
      In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
      underflow bvec->bv_len.
      
      Fix this by checking if truncated_bytes is lower than PAGE_SIZE.
      
      This situation has been found on filesystems such as isofs and vfat,
      which doesn't check the device size before mount, if the device is
      smaller than the filesystem itself, a readahead on such filesystem,
      which spans EOD, can trigger this situation, leading a call to
      zero_user() with a wrong size possibly corrupting memory.
      
      I didn't see any crash, or didn't let the system run long enough to
      check if memory corruption will be hit somewhere, but adding
      instrumentation to guard_bio_end() to check truncated_bytes size, was
      enough to see the error.
      
      The following script can trigger the error.
      
      MNT=/mnt
      IMG=./DISK.img
      DEV=/dev/loop0
      
      mkfs.vfat $IMG
      mount $IMG $MNT
      cp -R /etc $MNT &> /dev/null
      umount $MNT
      
      losetup -D
      
      losetup --find --show --sizelimit 16247280 $IMG
      mount $DEV $MNT
      
      find $MNT -type f -exec cat {} + >/dev/null
      
      Kudos to Eric Sandeen for coming up with the reproducer above
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dce30ca9
  25. 15 Feb, 2019 1 commit
  26. 06 Feb, 2019 1 commit
    • Tetsuo Handa's avatar
      fs: ratelimit __find_get_block_slow() failure message. · 43636c80
      Tetsuo Handa authored
      
      When something let __find_get_block_slow() hit all_mapped path, it calls
      printk() for 100+ times per a second. But there is no need to print same
      message with such high frequency; it is just asking for stall warning, or
      at least bloating log files.
      
        [  399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
        [  399.873324][T15342] b_state=0x00000029, b_size=512
        [  399.878403][T15342] device loop0 blocksize: 4096
        [  399.883296][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
        [  399.890400][T15342] b_state=0x00000029, b_size=512
        [  399.895595][T15342] device loop0 blocksize: 4096
        [  399.900556][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
        [  399.907471][T15342] b_state=0x00000029, b_size=512
        [  399.912506][T15342] device loop0 blocksize: 4096
      
      This patch reduces frequency to up to once per a second, in addition to
      concatenating three lines into one.
      
        [  399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8, b_state=0x00000029, b_size=512, device loop0 blocksize: 4096
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      43636c80
  27. 04 Jan, 2019 1 commit
  28. 08 Dec, 2018 1 commit
  29. 02 Nov, 2018 1 commit
  30. 21 Oct, 2018 1 commit
  31. 22 Sep, 2018 1 commit
  32. 30 Aug, 2018 1 commit
  33. 17 Aug, 2018 1 commit
  34. 19 Jun, 2018 2 commits
  35. 02 Jun, 2018 1 commit