1. 23 Jun, 2021 2 commits
  2. 14 May, 2021 1 commit
    • Chao Yu's avatar
      f2fs: restructure f2fs page.private layout · b763f3be
      Chao Yu authored
      Restruct f2fs page private layout for below reasons:
      
      There are some cases that f2fs wants to set a flag in a page to
      indicate a specified status of page:
      a) page is in transaction list for atomic write
      b) page contains dummy data for aligned write
      c) page is migrating for GC
      d) page contains inline data for inline inode flush
      e) page belongs to merkle tree, and is verified for fsverity
      f) page is dirty and has filesystem/inode reference count for writeback
      g) page is temporary and has decompress io context reference for compression
      
      There are existed places in page structure we can use to store
      f2fs private status/data:
      - page.flags: PG_checked, PG_private
      - page.private
      
      However it was a mess when we using them, which may cause potential
      confliction:
      		page.private	PG_private	PG_checked	page._refcount (+1 at most)
      a)		-1		set				+1
      b)		-2		set
      c), d), e)					set
      f)		0		set				+1
      g)		pointer		set
      
      The other problem is page.flags has no free slot, if we can avoid set
      zero to page.private and set PG_private flag, then we use non-zero value
      to indicate PG_private status, so that we may have chance to reclaim
      PG_private slot for other usage. [1]
      
      The other concern is f2fs has bad scalability in aspect of indicating
      more page status.
      
      So in this patch, let's restructure f2fs' page.private as below to
      solve above issues:
      
      Layout A: lowest bit should be 1
      | bit0 = 1 | bit1 | bit2 | ... | bit MAX | private data .... |
       bit 0	PAGE_PRIVATE_NOT_POINTER
       bit 1	PAGE_PRIVATE_ATOMIC_WRITE
       bit 2	PAGE_PRIVATE_DUMMY_WRITE
       bit 3	PAGE_PRIVATE_ONGOING_MIGRATION
       bit 4	PAGE_PRIVATE_INLINE_INODE
       bit 5	PAGE_PRIVATE_REF_RESOURCE
       bit 6-	f2fs private data
      
      Layout B: lowest bit should be 0
       page.private is a wrapped pointer.
      
      After the change:
      		page.private	PG_private	PG_checked	page._refcount (+1 at most)
      a)		11		set				+1
      b)		101		set				+1
      c)		1001		set				+1
      d)		10001		set				+1
      e)						set
      f)		100001		set				+1
      g)		pointer		set				+1
      
      [1] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org/T/#u
      
      
      
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      b763f3be
  3. 11 May, 2021 1 commit
    • Jaegeuk Kim's avatar
      f2fs: avoid null pointer access when handling IPU error · 349c4d6c
      Jaegeuk Kim authored
       Unable to handle kernel NULL pointer dereference at virtual address 000000000000001a
       pc : f2fs_inplace_write_data+0x144/0x208
       lr : f2fs_inplace_write_data+0x134/0x208
       Call trace:
        f2fs_inplace_write_data+0x144/0x208
        f2fs_do_write_data_page+0x270/0x770
        f2fs_write_single_data_page+0x47c/0x830
        __f2fs_write_data_pages+0x444/0x98c
        f2fs_write_data_pages.llvm.16514453770497736882+0x2c/0x38
        do_writepages+0x58/0x118
        __writeback_single_inode+0x44/0x300
        writeback_sb_inodes+0x4b8/0x9c8
        wb_writeback+0x148/0x42c
        wb_do_writeback+0xc8/0x390
        wb_workfn+0xb0/0x2f4
        process_one_work+0x1fc/0x444
        worker_thread+0x268/0x4b4
        kthread+0x13c/0x158
        ret_from_fork+0x10/0x18
      
      Fixes: 95577278
      
       ("f2fs: drop inplace IO if fs status is abnormal")
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      349c4d6c
  4. 26 Apr, 2021 1 commit
  5. 22 Apr, 2021 1 commit
  6. 13 Apr, 2021 2 commits
    • Yi Chen's avatar
      f2fs: fix to avoid NULL pointer dereference · 594b6d04
      Yi Chen authored
      
      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      pc : f2fs_put_page+0x1c/0x26c
      lr : __revoke_inmem_pages+0x544/0x75c
      f2fs_put_page+0x1c/0x26c
      __revoke_inmem_pages+0x544/0x75c
      __f2fs_commit_inmem_pages+0x364/0x3c0
      f2fs_commit_inmem_pages+0xc8/0x1a0
      f2fs_ioc_commit_atomic_write+0xa4/0x15c
      f2fs_ioctl+0x5b0/0x1574
      file_ioctl+0x154/0x320
      do_vfs_ioctl+0x164/0x740
      __arm64_sys_ioctl+0x78/0xa4
      el0_svc_common+0xbc/0x1d0
      el0_svc_handler+0x74/0x98
      el0_svc+0x8/0xc
      
      In f2fs_put_page, we access page->mapping is NULL.
      The root cause is:
      In some cases, the page refcount and ATOMIC_WRITTEN_PAGE
      flag miss set for page-priavte flag has been set.
      We add f2fs_bug_on like this:
      
      f2fs_register_inmem_page()
      {
      	...
      	f2fs_set_page_private(page, ATOMIC_WRITTEN_PAGE);
      
      	f2fs_bug_on(F2FS_I_SB(inode), !IS_ATOMIC_WRITTEN_PAGE(page));
      	...
      }
      
      The bug on stack follow link this:
      PC is at f2fs_register_inmem_page+0x238/0x2b4
      LR is at f2fs_register_inmem_page+0x2a8/0x2b4
      f2fs_register_inmem_page+0x238/0x2b4
      f2fs_set_data_page_dirty+0x104/0x164
      set_page_dirty+0x78/0xc8
      f2fs_write_end+0x1b4/0x444
      generic_perform_write+0x144/0x1cc
      __generic_file_write_iter+0xc4/0x174
      f2fs_file_write_iter+0x2c0/0x350
      __vfs_write+0x104/0x134
      vfs_write+0xe8/0x19c
      SyS_pwrite64+0x78/0xb8
      
      To fix this issue, let's add page refcount add page-priavte flag.
      The page-private flag is not cleared and needs further analysis.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarGe Qiu <qiuge@huawei.com>
      Signed-off-by: default avatarDehe Gu <gudehe@huawei.com>
      Signed-off-by: default avatarYi Chen <chenyi77@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      594b6d04
    • Chao Yu's avatar
      f2fs: avoid duplicated codes for cleanup · 453e2ff8
      Chao Yu authored
      
      f2fs_segment_has_free_slot() was copied and modified from
      __next_free_blkoff(), they are almost the same, clean up to
      reuse common code as much as possible.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      453e2ff8
  7. 10 Apr, 2021 1 commit
  8. 06 Apr, 2021 2 commits
  9. 31 Mar, 2021 2 commits
    • Yi Zhuang's avatar
      f2fs: Fix a hungtask problem in atomic write · be1ee45d
      Yi Zhuang authored
      In the cache writing process, if it is an atomic file, increase the page
      count of F2FS_WB_CP_DATA, otherwise increase the page count of
      F2FS_WB_DATA.
      
      When you step into the hook branch due to insufficient memory in
      f2fs_write_begin, f2fs_drop_inmem_pages_all will be called to traverse
      all atomic inodes and clear the FI_ATOMIC_FILE mark of all atomic files.
      
      In f2fs_drop_inmem_pages,first acquire the inmem_lock , revoke all the
      inmem_pages, and then clear the FI_ATOMIC_FILE mark. Before this mark is
      cleared, other threads may hold inmem_lock to add inmem_pages to the inode
      that has just been emptied inmem_pages, and increase the page count of
      F2FS_WB_CP_DATA.
      
      When the IO returns, it is found that the FI_ATOMIC_FILE flag is cleared
      by f2fs_drop_inmem_pages_all, and f2fs_is_atomic_file returns false,which
      causes the page count of F2FS_WB_DATA to be decremented. The page count of
      F2FS_WB_CP_DATA cannot be cleared. Finally, hungtask is triggered in
      f2fs_wait_on_all_pages because get_pages will never return zero.
      
      process A:				process B:
      f2fs_drop_inmem_pages_all
      ->f2fs_drop_inmem_pages of inode#1
          ->mutex_lock(&fi->inmem_lock)
          ->__revoke_inmem_pages of inode#1	f2fs_ioc_commit_atomic_write
          ->mutex_unlock(&fi->inmem_lock)	->f2fs_commit_inmem_pages of inode#1
      					->mutex_lock(&fi->inmem_lock)
      					->__f2fs_commit_inmem_pages
      					    ->f2fs_do_write_data_page
      					        ->f2fs_outplace_write_data
      					            ->do_write_page
      					                ->f2fs_submit_page_write
      					                    ->inc_page_count(sbi, F2FS_WB_CP_DATA )
      					->mutex_unlock(&fi->inmem_lock)
          ->spin_lock(&sbi->inode_lock[ATOMIC_FILE]);
          ->clear_inode_flag(inode, FI_ATOMIC_FILE)
          ->spin_unlock(&sbi->inode_lock[ATOMIC_FILE])
      					f2fs_write_end_io
      					->dec_page_count(sbi, F2FS_WB_DATA );
      
      We can fix the problem by putting the action of clearing the FI_ATOMIC_FILE
      mark into the inmem_lock lock. This operation can ensure that no one will
      submit the inmem pages before the FI_ATOMIC_FILE mark is cleared, so that
      there will be no atomic writes waiting for writeback.
      
      Fixes: 57864ae5
      
       ("f2fs: limit # of inmemory pages")
      Signed-off-by: default avatarYi Zhuang <zhuangyi1@huawei.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      be1ee45d
    • Chao Yu's avatar
      f2fs: introduce gc_merge mount option · 5911d2d1
      Chao Yu authored
      
      In this patch, we will add two new mount options: "gc_merge" and
      "nogc_merge", when background_gc is on, "gc_merge" option can be
      set to let background GC thread to handle foreground GC requests,
      it can eliminate the sluggish issue caused by slow foreground GC
      operation when GC is triggered from a process with limited I/O
      and CPU resources.
      
      Original idea is from Xiang.
      Signed-off-by: default avatarGao Xiang <xiang@kernel.org>
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      5911d2d1
  10. 26 Mar, 2021 5 commits
  11. 23 Mar, 2021 1 commit
    • Chao Yu's avatar
      f2fs: fix to align to section for fallocate() on pinned file · e1175f02
      Chao Yu authored
      
      Now, fallocate() on a pinned file only allocates blocks which aligns
      to segment rather than section, so GC may try to migrate pinned file's
      block, and after several times of failure, pinned file's block could
      be migrated to other place, however user won't be aware of such
      condition, and then old obsolete block address may be readed/written
      incorrectly.
      
      To avoid such condition, let's try to allocate pinned file's blocks
      with section alignment.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      e1175f02
  12. 12 Mar, 2021 3 commits
  13. 11 Mar, 2021 1 commit
  14. 08 Feb, 2021 1 commit
    • Jaegeuk Kim's avatar
      f2fs: don't grab superblock freeze for flush/ckpt thread · d50dfc0c
      Jaegeuk Kim authored
      
      There are controlled by f2fs_freeze().
      
      This fixes xfstests/generic/068 which is stuck at
      
       task:f2fs_ckpt-252:3 state:D stack:    0 pid: 5761 ppid:     2 flags:0x00004000
       Call Trace:
        __schedule+0x44c/0x8a0
        schedule+0x4f/0xc0
        percpu_rwsem_wait+0xd8/0x140
        ? percpu_down_write+0xf0/0xf0
        __percpu_down_read+0x56/0x70
        issue_checkpoint_thread+0x12c/0x160 [f2fs]
        ? wait_woken+0x80/0x80
        kthread+0x114/0x150
        ? __checkpoint_and_complete_reqs+0x110/0x110 [f2fs]
        ? kthread_park+0x90/0x90
        ret_from_fork+0x22/0x30
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      d50dfc0c
  15. 27 Jan, 2021 2 commits
  16. 03 Dec, 2020 3 commits
    • Daeho Jeong's avatar
      f2fs: add compress_mode mount option · 602a16d5
      Daeho Jeong authored
      
      We will add a new "compress_mode" mount option to control file
      compression mode. This supports "fs" and "user". In "fs" mode (default),
      f2fs does automatic compression on the compression enabled files.
      In "user" mode, f2fs disables the automaic compression and gives the
      user discretion of choosing the target file and the timing. It means
      the user can do manual compression/decompression on the compression
      enabled files using ioctls.
      Signed-off-by: default avatarDaeho Jeong <daehojeong@google.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      602a16d5
    • Jack Qiu's avatar
      f2fs: init dirty_secmap incorrectly · 5335bfc6
      Jack Qiu authored
      
      section is dirty, but dirty_secmap may not set
      Reported-by: default avatarJia Yang <jiayang5@huawei.com>
      Fixes: da52f8ad
      
       ("f2fs: get the right gc victim section when section has several segments")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJack Qiu <jack.qiu@huawei.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      5335bfc6
    • Chao Yu's avatar
      f2fs: fix to avoid REQ_TIME and CP_TIME collision · 493720a4
      Chao Yu authored
      
      Lei Li reported a issue: if foreground operations are frequent, background
      checkpoint may be always skipped due to below check, result in losing more
      data after sudden power-cut.
      
      f2fs_balance_fs_bg()
      ...
      	if (!is_idle(sbi, REQ_TIME) &&
      		(!excess_dirty_nats(sbi) && !excess_dirty_nodes(sbi)))
      		return;
      
      E.g:
      cp_interval = 5 second
      idle_interval = 2 second
      foreground operation interval = 1 second (append 1 byte per second into file)
      
      In such case, no matter when it calls f2fs_balance_fs_bg(), is_idle(, REQ_TIME)
      returns false, result in skipping background checkpoint.
      
      This patch changes as below to make trigger condition being more reasonable:
      - trigger sync_fs() if dirty_{nats,nodes} and prefree segs exceeds threshold;
      - skip triggering sync_fs() if there is any background inflight IO or there is
      foreground operation recently and meanwhile cp_rwsem is being held by someone;
      Reported-by: default avatarLei Li <noctis.akm@gmail.com>
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      493720a4
  17. 14 Oct, 2020 2 commits
    • Chao Yu's avatar
      f2fs: don't issue flush in f2fs_flush_device_cache() for nobarrier case · 6ed29fe1
      Chao Yu authored
      
      This patch changes f2fs_flush_device_cache() to skip issuing flush for
      nobarrier case.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      6ed29fe1
    • Jaegeuk Kim's avatar
      f2fs: handle errors of f2fs_get_meta_page_nofail · 86f33603
      Jaegeuk Kim authored
      
      First problem is we hit BUG_ON() in f2fs_get_sum_page given EIO on
      f2fs_get_meta_page_nofail().
      
      Quick fix was not to give any error with infinite loop, but syzbot caught
      a case where it goes to that loop from fuzzed image. In turned out we abused
      f2fs_get_meta_page_nofail() like in the below call stack.
      
      - f2fs_fill_super
       - f2fs_build_segment_manager
        - build_sit_entries
         - get_current_sit_page
      
      INFO: task syz-executor178:6870 can't die for more than 143 seconds.
      task:syz-executor178 state:R
       stack:26960 pid: 6870 ppid:  6869 flags:0x00004006
      Call Trace:
      
      Showing all locks held in the system:
      1 lock held by khungtaskd/1179:
       #0: ffffffff8a554da0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6242
      1 lock held by systemd-journal/3920:
      1 lock held by in:imklog/6769:
       #0: ffff88809eebc130 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:930
      1 lock held by syz-executor178/6870:
       #0: ffff8880925120e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0x201/0xaf0 fs/super.c:229
      
      Actually, we didn't have to use _nofail in this case, since we could return
      error to mount(2) already with the error handler.
      
      As a result, this patch tries to 1) remove _nofail callers as much as possible,
      2) deal with error case in last remaining caller, f2fs_get_sum_page().
      
      Reported-by: syzbot+ee250ac8137be41d7b13@syzkaller.appspotmail.com
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      86f33603
  18. 14 Sep, 2020 1 commit
  19. 11 Sep, 2020 1 commit
    • Chao Yu's avatar
      f2fs: support age threshold based garbage collection · 093749e2
      Chao Yu authored
      
      There are several issues in current background GC algorithm:
      - valid blocks is one of key factors during cost overhead calculation,
      so if segment has less valid block, however even its age is young or
      it locates hot segment, CB algorithm will still choose the segment as
      victim, it's not appropriate.
      - GCed data/node will go to existing logs, no matter in-there datas'
      update frequency is the same or not, it may mix hot and cold data
      again.
      - GC alloctor mainly use LFS type segment, it will cost free segment
      more quickly.
      
      This patch introduces a new algorithm named age threshold based
      garbage collection to solve above issues, there are three steps
      mainly:
      
      1. select a source victim:
      - set an age threshold, and select candidates beased threshold:
      e.g.
       0 means youngest, 100 means oldest, if we set age threshold to 80
       then select dirty segments which has age in range of [80, 100] as
       candiddates;
      - set candidate_ratio threshold, and select candidates based the
      ratio, so that we can shrink candidates to those oldest segments;
      - select target segment with fewest valid blocks in order to
      migrate blocks with minimum cost;
      
      2. select a target victim:
      - select candidates beased age threshold;
      - set candidate_radius threshold, search candidates whose age is
      around source victims, searching radius should less than the
      radius threshold.
      - select target segment with most valid blocks in order to avoid
      migrating current target segment.
      
      3. merge valid blocks from source victim into target victim with
      SSR alloctor.
      
      Test steps:
      - create 160 dirty segments:
       * half of them have 128 valid blocks per segment
       * left of them have 384 valid blocks per segment
      - run background GC
      
      Benefit: GC count and block movement count both decrease obviously:
      
      - Before:
        - Valid: 86
        - Dirty: 1
        - Prefree: 11
        - Free: 6001 (6001)
      
      GC calls: 162 (BG: 220)
        - data segments : 160 (160)
        - node segments : 2 (2)
      Try to move 41454 blocks (BG: 41454)
        - data blocks : 40960 (40960)
        - node blocks : 494 (494)
      
      IPU: 0 blocks
      SSR: 0 blocks in 0 segments
      LFS: 41364 blocks in 81 segments
      
      - After:
      
        - Valid: 87
        - Dirty: 0
        - Prefree: 4
        - Free: 6008 (6008)
      
      GC calls: 75 (BG: 76)
        - data segments : 74 (74)
        - node segments : 1 (1)
      Try to move 12813 blocks (BG: 12813)
        - data blocks : 12544 (12544)
        - node blocks : 269 (269)
      
      IPU: 0 blocks
      SSR: 12032 blocks in 77 segments
      LFS: 855 blocks in 2 segments
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      [Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      093749e2
  20. 10 Sep, 2020 6 commits
    • Chao Yu's avatar
      f2fs: support 64-bits key in f2fs rb-tree node entry · 2e9b2bb2
      Chao Yu authored
      
      then, we can add specified entry into rb-tree with 64-bits segment time
      as key.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      2e9b2bb2
    • Chao Yu's avatar
      f2fs: inherit mtime of original block during GC · c5d02785
      Chao Yu authored
      
      Don't let f2fs inner GC ruins original aging degree of segment.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      c5d02785
    • Chao Yu's avatar
      f2fs: record average update time of segment · 6f3a01ae
      Chao Yu authored
      
      Previously, once we update one block in segment, we will update mtime of
      segment to last time, making aged segment becoming freshest, result in
      that GC with cost benefit algorithm missing such segment, So this patch
      changes to record mtime as average block updating time instead of last
      updating time.
      
      It's not needed to reset mtime for prefree segment, as se->valid_blocks
      is zero, then old se->mtime won't take any weight with below calculation:
      
      	se->mtime = div_u64(se->mtime * se->valid_blocks + mtime,
      					se->valid_blocks + 1);
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      6f3a01ae
    • Chao Yu's avatar
      f2fs: introduce inmem curseg · d0b9e42a
      Chao Yu authored
      
      Previous implementation of aligned pinfile allocation will:
      - allocate new segment on cold data log no matter whether last used
      segment is partially used or not, it makes IOs more random;
      - force concurrent cold data/GCed IO going into warm data area, it
      can make a bad effect on hot/cold data separation;
      
      In this patch, we introduce a new type of log named 'inmem curseg',
      the differents from normal curseg is:
      - it reuses existed segment type (CURSEG_XXX_NODE/DATA);
      - it only exists in memory, its segno, blkofs, summary will not b
       persisted into checkpoint area;
      
      With this new feature, we can enhance scalability of log, special
      allocators can be created for purposes:
      - pure lfs allocator for aligned pinfile allocation or file
      defragmentation
      - pure ssr allocator for later feature
      
      So that, let's update aligned pinfile allocation to use this new
      inmem curseg fwk.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      d0b9e42a
    • Xiaojun Wang's avatar
      f2fs: remove duplicated type casting · e90027d2
      Xiaojun Wang authored
      
      Since DUMMY_WRITTEN_PAGE and ATOMIC_WRITTEN_PAGE have already been
      converted as unsigned long type, we don't need do type casting again.
      Signed-off-by: default avatarXiaojun Wang <wangxiaojun11@huawei.com>
      Reported-by: default avatarJack Qiu <jack.qiu@huawei.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      e90027d2
    • Aravind Ramesh's avatar
      f2fs: support zone capacity less than zone size · de881df9
      Aravind Ramesh authored
      
      NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
      Zone-capacity indicates the maximum number of sectors that are usable in
      a zone beginning from the first sector of the zone. This makes the sectors
      sectors after the zone-capacity till zone-size to be unusable.
      This patch set tracks zone-size and zone-capacity in zoned devices and
      calculate the usable blocks per segment and usable segments per section.
      
      If zone-capacity is less than zone-size mark only those segments which
      start before zone-capacity as free segments. All segments at and beyond
      zone-capacity are treated as permanently used segments. In cases where
      zone-capacity does not align with segment size the last segment will start
      before zone-capacity and end beyond the zone-capacity of the zone. For
      such spanning segments only sectors within the zone-capacity are used.
      
      During writes and GC manage the usable segments in a section and usable
      blocks per segment. Segments which are beyond zone-capacity are never
      allocated, and do not need to be garbage collected, only the segments
      which are before zone-capacity needs to garbage collected.
      For spanning segments based on the number of usable blocks in that
      segment, write to blocks only up to zone-capacity.
      
      Zone-capacity is device specific and cannot be configured by the user.
      Since NVMe ZNS device zones are sequentially write only, a block device
      with conventional zones or any normal block device is needed along with
      the ZNS device for the metadata operations of F2fs.
      
      A typical nvme-cli output of a zoned device shows zone start and capacity
      and write pointer as below:
      
      SLBA: 0x0     WP: 0x0     Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
      SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
      SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
      
      Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
      are in EMPTY state. For each zone, only zone start + 49MB is usable area,
      any lba/sector after 49MB cannot be read or written to, the drive will fail
      any attempts to read/write. So, the second zone starts at 64MB and is
      usable till 113MB (64 + 49) and the range between 113 and 128MB is
      again unusable. The next zone starts at 128MB, and so on.
      Signed-off-by: default avatarAravind Ramesh <aravind.ramesh@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      de881df9
  21. 09 Sep, 2020 1 commit