- 23 Jun, 2021 2 commits
-
-
Jaegeuk Kim authored
Given RO feature in superblock, we don't need to check provisioning/reserve spaces and SSA area. Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Joe Perches authored
Update the logging uses that have unnecessary newlines as the f2fs_printk function and so its f2fs_<level> macro callers already adds one. This allows searching single line logging entries with an easier grep and also avoids unnecessary blank lines in the logging. Miscellanea: o Coalesce formats o Align to open parenthesis Signed-off-by:
Joe Perches <joe@perches.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 14 May, 2021 1 commit
-
-
Chao Yu authored
Restruct f2fs page private layout for below reasons: There are some cases that f2fs wants to set a flag in a page to indicate a specified status of page: a) page is in transaction list for atomic write b) page contains dummy data for aligned write c) page is migrating for GC d) page contains inline data for inline inode flush e) page belongs to merkle tree, and is verified for fsverity f) page is dirty and has filesystem/inode reference count for writeback g) page is temporary and has decompress io context reference for compression There are existed places in page structure we can use to store f2fs private status/data: - page.flags: PG_checked, PG_private - page.private However it was a mess when we using them, which may cause potential confliction: page.private PG_private PG_checked page._refcount (+1 at most) a) -1 set +1 b) -2 set c), d), e) set f) 0 set +1 g) pointer set The other problem is page.flags has no free slot, if we can avoid set zero to page.private and set PG_private flag, then we use non-zero value to indicate PG_private status, so that we may have chance to reclaim PG_private slot for other usage. [1] The other concern is f2fs has bad scalability in aspect of indicating more page status. So in this patch, let's restructure f2fs' page.private as below to solve above issues: Layout A: lowest bit should be 1 | bit0 = 1 | bit1 | bit2 | ... | bit MAX | private data .... | bit 0 PAGE_PRIVATE_NOT_POINTER bit 1 PAGE_PRIVATE_ATOMIC_WRITE bit 2 PAGE_PRIVATE_DUMMY_WRITE bit 3 PAGE_PRIVATE_ONGOING_MIGRATION bit 4 PAGE_PRIVATE_INLINE_INODE bit 5 PAGE_PRIVATE_REF_RESOURCE bit 6- f2fs private data Layout B: lowest bit should be 0 page.private is a wrapped pointer. After the change: page.private PG_private PG_checked page._refcount (+1 at most) a) 11 set +1 b) 101 set +1 c) 1001 set +1 d) 10001 set +1 e) set f) 100001 set +1 g) pointer set +1 [1] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org/T/#u Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 11 May, 2021 1 commit
-
-
Jaegeuk Kim authored
Unable to handle kernel NULL pointer dereference at virtual address 000000000000001a pc : f2fs_inplace_write_data+0x144/0x208 lr : f2fs_inplace_write_data+0x134/0x208 Call trace: f2fs_inplace_write_data+0x144/0x208 f2fs_do_write_data_page+0x270/0x770 f2fs_write_single_data_page+0x47c/0x830 __f2fs_write_data_pages+0x444/0x98c f2fs_write_data_pages.llvm.16514453770497736882+0x2c/0x38 do_writepages+0x58/0x118 __writeback_single_inode+0x44/0x300 writeback_sb_inodes+0x4b8/0x9c8 wb_writeback+0x148/0x42c wb_do_writeback+0xc8/0x390 wb_workfn+0xb0/0x2f4 process_one_work+0x1fc/0x444 worker_thread+0x268/0x4b4 kthread+0x13c/0x158 ret_from_fork+0x10/0x18 Fixes: 95577278 ("f2fs: drop inplace IO if fs status is abnormal") Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 26 Apr, 2021 1 commit
-
-
Chao Yu authored
If filesystem has cp_error or need_fsck status, let's drop inplace IO to avoid further corruption of fs data. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 22 Apr, 2021 1 commit
-
-
Chao Yu authored
As we did for other cases, in fix_curseg_write_pointer(), let's use wrapped f2fs_allocate_new_section() instead of native allocate_segment_by_default(), by this way, it fixes to cover segment allocation with curseg_lock and sentry_lock. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 13 Apr, 2021 2 commits
-
-
Yi Chen authored
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 pc : f2fs_put_page+0x1c/0x26c lr : __revoke_inmem_pages+0x544/0x75c f2fs_put_page+0x1c/0x26c __revoke_inmem_pages+0x544/0x75c __f2fs_commit_inmem_pages+0x364/0x3c0 f2fs_commit_inmem_pages+0xc8/0x1a0 f2fs_ioc_commit_atomic_write+0xa4/0x15c f2fs_ioctl+0x5b0/0x1574 file_ioctl+0x154/0x320 do_vfs_ioctl+0x164/0x740 __arm64_sys_ioctl+0x78/0xa4 el0_svc_common+0xbc/0x1d0 el0_svc_handler+0x74/0x98 el0_svc+0x8/0xc In f2fs_put_page, we access page->mapping is NULL. The root cause is: In some cases, the page refcount and ATOMIC_WRITTEN_PAGE flag miss set for page-priavte flag has been set. We add f2fs_bug_on like this: f2fs_register_inmem_page() { ... f2fs_set_page_private(page, ATOMIC_WRITTEN_PAGE); f2fs_bug_on(F2FS_I_SB(inode), !IS_ATOMIC_WRITTEN_PAGE(page)); ... } The bug on stack follow link this: PC is at f2fs_register_inmem_page+0x238/0x2b4 LR is at f2fs_register_inmem_page+0x2a8/0x2b4 f2fs_register_inmem_page+0x238/0x2b4 f2fs_set_data_page_dirty+0x104/0x164 set_page_dirty+0x78/0xc8 f2fs_write_end+0x1b4/0x444 generic_perform_write+0x144/0x1cc __generic_file_write_iter+0xc4/0x174 f2fs_file_write_iter+0x2c0/0x350 __vfs_write+0x104/0x134 vfs_write+0xe8/0x19c SyS_pwrite64+0x78/0xb8 To fix this issue, let's add page refcount add page-priavte flag. The page-private flag is not cleared and needs further analysis. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Ge Qiu <qiuge@huawei.com> Signed-off-by:
Dehe Gu <gudehe@huawei.com> Signed-off-by:
Yi Chen <chenyi77@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
f2fs_segment_has_free_slot() was copied and modified from __next_free_blkoff(), they are almost the same, clean up to reuse common code as much as possible. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 10 Apr, 2021 1 commit
-
-
Yi Zhuang authored
This patch combined the below three clean-up patches. - modify open brace '{' following function definitions - ERROR: spaces required around that ':' - ERROR: spaces required before the open parenthesis '(' - ERROR: spaces prohibited before that ',' - Made suggested modifications from checkpatch in reference to WARNING: Missing a blank line after declarations Signed-off-by:
Yi Zhuang <zhuangyi1@huawei.com> Signed-off-by:
Jia Yang <jiayang5@huawei.com> Signed-off-by:
Jack Qiu <jack.qiu@huawei.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 06 Apr, 2021 2 commits
-
-
Sahitya Tummala authored
Fix the unnecessary periodic wakeups of discard thread that happens under below two conditions - 1. When f2fs is heavily utilized over 80%, the current discard policy sets the max sleep timeout of discard thread as 50ms (DEF_MIN_DISCARD_ISSUE_TIME). But this is set even when there are no pending discard commands to be issued. 2. In the issue_discard_thread() path when there are no pending discard commands, it fails to reset the wait_ms to max timeout value. Signed-off-by:
Sahitya Tummala <stummala@codeaurora.org> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
Callers may pass fio parameter with NULL value to f2fs_allocate_data_block(), so we should make sure accessing fio's field after fio's validation check. Fixes: f608c38c ("f2fs: clean up parameter of f2fs_allocate_data_block()") Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 31 Mar, 2021 2 commits
-
-
Yi Zhuang authored
In the cache writing process, if it is an atomic file, increase the page count of F2FS_WB_CP_DATA, otherwise increase the page count of F2FS_WB_DATA. When you step into the hook branch due to insufficient memory in f2fs_write_begin, f2fs_drop_inmem_pages_all will be called to traverse all atomic inodes and clear the FI_ATOMIC_FILE mark of all atomic files. In f2fs_drop_inmem_pages,first acquire the inmem_lock , revoke all the inmem_pages, and then clear the FI_ATOMIC_FILE mark. Before this mark is cleared, other threads may hold inmem_lock to add inmem_pages to the inode that has just been emptied inmem_pages, and increase the page count of F2FS_WB_CP_DATA. When the IO returns, it is found that the FI_ATOMIC_FILE flag is cleared by f2fs_drop_inmem_pages_all, and f2fs_is_atomic_file returns false,which causes the page count of F2FS_WB_DATA to be decremented. The page count of F2FS_WB_CP_DATA cannot be cleared. Finally, hungtask is triggered in f2fs_wait_on_all_pages because get_pages will never return zero. process A: process B: f2fs_drop_inmem_pages_all ->f2fs_drop_inmem_pages of inode#1 ->mutex_lock(&fi->inmem_lock) ->__revoke_inmem_pages of inode#1 f2fs_ioc_commit_atomic_write ->mutex_unlock(&fi->inmem_lock) ->f2fs_commit_inmem_pages of inode#1 ->mutex_lock(&fi->inmem_lock) ->__f2fs_commit_inmem_pages ->f2fs_do_write_data_page ->f2fs_outplace_write_data ->do_write_page ->f2fs_submit_page_write ->inc_page_count(sbi, F2FS_WB_CP_DATA ) ->mutex_unlock(&fi->inmem_lock) ->spin_lock(&sbi->inode_lock[ATOMIC_FILE]); ->clear_inode_flag(inode, FI_ATOMIC_FILE) ->spin_unlock(&sbi->inode_lock[ATOMIC_FILE]) f2fs_write_end_io ->dec_page_count(sbi, F2FS_WB_DATA ); We can fix the problem by putting the action of clearing the FI_ATOMIC_FILE mark into the inmem_lock lock. This operation can ensure that no one will submit the inmem pages before the FI_ATOMIC_FILE mark is cleared, so that there will be no atomic writes waiting for writeback. Fixes: 57864ae5 ("f2fs: limit # of inmemory pages") Signed-off-by:
Yi Zhuang <zhuangyi1@huawei.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
In this patch, we will add two new mount options: "gc_merge" and "nogc_merge", when background_gc is on, "gc_merge" option can be set to let background GC thread to handle foreground GC requests, it can eliminate the sluggish issue caused by slow foreground GC operation when GC is triggered from a process with limited I/O and CPU resources. Original idea is from Xiang. Signed-off-by:
Gao Xiang <xiang@kernel.org> Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 26 Mar, 2021 5 commits
-
-
Chao Yu authored
In order to avoid race with f2fs_do_replace_block(). Fixes: f5a53edc ("f2fs: support aligned pinned file") Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Wang Xiaojun authored
If the alloc_type of the original curseg is LFS, when we change_curseg and then do recover curseg, the alloc_type becomes SSR. Signed-off-by:
Wang Xiaojun <wangxiaojun11@huawei.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Sahitya Tummala authored
With the default DPOLICY_BG discard thread is ioaware, which prevents the discard thread from issuing the discard commands. On low RAM setups, it is observed that these discard commands in the cache are consuming high memory. This patch aims to relax the memory pressure on the system due to f2fs pending discard cmds by changing the policy to DPOLICY_FORCE based on the nm_i->ram_thresh configured. Signed-off-by:
Sahitya Tummala <stummala@codeaurora.org> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
In CP disabling mode, there are two issues when using LFS or SSR | AT_SSR mode to select victim: 1. LFS is set to find source section during GC, the victim should have no checkpointed data, since after GC, section could not be set free for reuse. Previously, we only check valid chpt blocks in current segment rather than section, fix it. 2. SSR | AT_SSR are set to find target segment for writes which can be fully filled by checkpointed and newly written blocks, we should never select such segment, otherwise it can cause panic or data corruption during allocation, potential case is described as below: a) target segment has 'n' (n < 512) ckpt valid blocks b) GC migrates 'n' valid blocks to other segment (segment is still in dirty list) c) GC migrates '512 - n' blocks to target segment (segment has 'n' cp_vblocks and '512 - n' vblocks) d) If GC selects target segment via {AT,}SSR allocator, however there is no free space in targe segment. Fixes: 4354994f ("f2fs: checkpoint disabling") Fixes: 093749e2 ("f2fs: support age threshold based garbage collection") Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Weichao Guo authored
AT_SSR mode is introduced by age threshold based GC for better hot/cold data seperation and avoiding free segment cost. However, LFS write mode is preferred in the scenario of foreground or high urgent GC, which should be finished ASAP. Let's only use AT_SSR in background GC and not high urgent GC modes. Signed-off-by:
Weichao Guo <guoweichao@oppo.com> Signed-off-by:
Huang Jianan <huangjianan@oppo.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 23 Mar, 2021 1 commit
-
-
Chao Yu authored
Now, fallocate() on a pinned file only allocates blocks which aligns to segment rather than section, so GC may try to migrate pinned file's block, and after several times of failure, pinned file's block could be migrated to other place, however user won't be aware of such condition, and then old obsolete block address may be readed/written incorrectly. To avoid such condition, let's try to allocate pinned file's blocks with section alignment. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 12 Mar, 2021 3 commits
-
-
Chao Yu authored
In trim thread, let's add a condition to check discard command number before traversing discard pending list, it can avoid unneeded traversing if there is no discard command. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
Add more detailed comments for explicit memory barrier used by f2fs, in order to enhance code readability. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
F2FS_IOC_FLUSH_DEVICE/F2FS_IOC_RESIZE_FS needs to migrate all blocks of target segment to other place, no matter the segment has partially or fully valid blocks. However, after commit 803e74be ("f2fs: stop GC when the victim becomes fully valid"), we may skip migration due to target segment is fully valid, result in failing the ioctl interface, fix this. Fixes: 803e74be ("f2fs: stop GC when the victim becomes fully valid") Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 11 Mar, 2021 1 commit
-
-
Christoph Hellwig authored
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 08 Feb, 2021 1 commit
-
-
Jaegeuk Kim authored
There are controlled by f2fs_freeze(). This fixes xfstests/generic/068 which is stuck at task:f2fs_ckpt-252:3 state:D stack: 0 pid: 5761 ppid: 2 flags:0x00004000 Call Trace: __schedule+0x44c/0x8a0 schedule+0x4f/0xc0 percpu_rwsem_wait+0xd8/0x140 ? percpu_down_write+0xf0/0xf0 __percpu_down_read+0x56/0x70 issue_checkpoint_thread+0x12c/0x160 [f2fs] ? wait_woken+0x80/0x80 kthread+0x114/0x150 ? __checkpoint_and_complete_reqs+0x110/0x110 [f2fs] ? kthread_park+0x90/0x90 ret_from_fork+0x22/0x30 Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 27 Jan, 2021 2 commits
-
-
Jaegeuk Kim authored
This patch deprecates f2fs_trace_io, since f2fs uses page->private more broadly, resulting in more buggy cases. Acked-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Christoph Hellwig authored
Use the blkdev_issue_flush helper instead of duplicating it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 03 Dec, 2020 3 commits
-
-
Daeho Jeong authored
We will add a new "compress_mode" mount option to control file compression mode. This supports "fs" and "user". In "fs" mode (default), f2fs does automatic compression on the compression enabled files. In "user" mode, f2fs disables the automaic compression and gives the user discretion of choosing the target file and the timing. It means the user can do manual compression/decompression on the compression enabled files using ioctls. Signed-off-by:
Daeho Jeong <daehojeong@google.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Jack Qiu authored
section is dirty, but dirty_secmap may not set Reported-by:
Jia Yang <jiayang5@huawei.com> Fixes: da52f8ad ("f2fs: get the right gc victim section when section has several segments") Cc: <stable@vger.kernel.org> Signed-off-by:
Jack Qiu <jack.qiu@huawei.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
Lei Li reported a issue: if foreground operations are frequent, background checkpoint may be always skipped due to below check, result in losing more data after sudden power-cut. f2fs_balance_fs_bg() ... if (!is_idle(sbi, REQ_TIME) && (!excess_dirty_nats(sbi) && !excess_dirty_nodes(sbi))) return; E.g: cp_interval = 5 second idle_interval = 2 second foreground operation interval = 1 second (append 1 byte per second into file) In such case, no matter when it calls f2fs_balance_fs_bg(), is_idle(, REQ_TIME) returns false, result in skipping background checkpoint. This patch changes as below to make trigger condition being more reasonable: - trigger sync_fs() if dirty_{nats,nodes} and prefree segs exceeds threshold; - skip triggering sync_fs() if there is any background inflight IO or there is foreground operation recently and meanwhile cp_rwsem is being held by someone; Reported-by:
Lei Li <noctis.akm@gmail.com> Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 14 Oct, 2020 2 commits
-
-
Chao Yu authored
This patch changes f2fs_flush_device_cache() to skip issuing flush for nobarrier case. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Jaegeuk Kim authored
First problem is we hit BUG_ON() in f2fs_get_sum_page given EIO on f2fs_get_meta_page_nofail(). Quick fix was not to give any error with infinite loop, but syzbot caught a case where it goes to that loop from fuzzed image. In turned out we abused f2fs_get_meta_page_nofail() like in the below call stack. - f2fs_fill_super - f2fs_build_segment_manager - build_sit_entries - get_current_sit_page INFO: task syz-executor178:6870 can't die for more than 143 seconds. task:syz-executor178 state:R stack:26960 pid: 6870 ppid: 6869 flags:0x00004006 Call Trace: Showing all locks held in the system: 1 lock held by khungtaskd/1179: #0: ffffffff8a554da0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6242 1 lock held by systemd-journal/3920: 1 lock held by in:imklog/6769: #0: ffff88809eebc130 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:930 1 lock held by syz-executor178/6870: #0: ffff8880925120e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0x201/0xaf0 fs/super.c:229 Actually, we didn't have to use _nofail in this case, since we could return error to mount(2) already with the error handler. As a result, this patch tries to 1) remove _nofail callers as much as possible, 2) deal with error case in last remaining caller, f2fs_get_sum_page(). Reported-by: syzbot+ee250ac8137be41d7b13@syzkaller.appspotmail.com Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 14 Sep, 2020 1 commit
-
-
Chao Yu authored
After commit 0b6d4ca0 ("f2fs: don't return vmalloc() memory from f2fs_kmalloc()"), f2fs_k{m,z}alloc() will not return vmalloc()'ed memory, so clean up to use kfree() instead of kvfree() to free vmalloc()'ed memory. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 11 Sep, 2020 1 commit
-
-
Chao Yu authored
There are several issues in current background GC algorithm: - valid blocks is one of key factors during cost overhead calculation, so if segment has less valid block, however even its age is young or it locates hot segment, CB algorithm will still choose the segment as victim, it's not appropriate. - GCed data/node will go to existing logs, no matter in-there datas' update frequency is the same or not, it may mix hot and cold data again. - GC alloctor mainly use LFS type segment, it will cost free segment more quickly. This patch introduces a new algorithm named age threshold based garbage collection to solve above issues, there are three steps mainly: 1. select a source victim: - set an age threshold, and select candidates beased threshold: e.g. 0 means youngest, 100 means oldest, if we set age threshold to 80 then select dirty segments which has age in range of [80, 100] as candiddates; - set candidate_ratio threshold, and select candidates based the ratio, so that we can shrink candidates to those oldest segments; - select target segment with fewest valid blocks in order to migrate blocks with minimum cost; 2. select a target victim: - select candidates beased age threshold; - set candidate_radius threshold, search candidates whose age is around source victims, searching radius should less than the radius threshold. - select target segment with most valid blocks in order to avoid migrating current target segment. 3. merge valid blocks from source victim into target victim with SSR alloctor. Test steps: - create 160 dirty segments: * half of them have 128 valid blocks per segment * left of them have 384 valid blocks per segment - run background GC Benefit: GC count and block movement count both decrease obviously: - Before: - Valid: 86 - Dirty: 1 - Prefree: 11 - Free: 6001 (6001) GC calls: 162 (BG: 220) - data segments : 160 (160) - node segments : 2 (2) Try to move 41454 blocks (BG: 41454) - data blocks : 40960 (40960) - node blocks : 494 (494) IPU: 0 blocks SSR: 0 blocks in 0 segments LFS: 41364 blocks in 81 segments - After: - Valid: 87 - Dirty: 0 - Prefree: 4 - Free: 6008 (6008) GC calls: 75 (BG: 76) - data segments : 74 (74) - node segments : 1 (1) Try to move 12813 blocks (BG: 12813) - data blocks : 12544 (12544) - node blocks : 269 (269) IPU: 0 blocks SSR: 12032 blocks in 77 segments LFS: 855 blocks in 2 segments Signed-off-by:
Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up] Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 10 Sep, 2020 6 commits
-
-
Chao Yu authored
then, we can add specified entry into rb-tree with 64-bits segment time as key. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
Don't let f2fs inner GC ruins original aging degree of segment. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
Previously, once we update one block in segment, we will update mtime of segment to last time, making aged segment becoming freshest, result in that GC with cost benefit algorithm missing such segment, So this patch changes to record mtime as average block updating time instead of last updating time. It's not needed to reset mtime for prefree segment, as se->valid_blocks is zero, then old se->mtime won't take any weight with below calculation: se->mtime = div_u64(se->mtime * se->valid_blocks + mtime, se->valid_blocks + 1); Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Chao Yu authored
Previous implementation of aligned pinfile allocation will: - allocate new segment on cold data log no matter whether last used segment is partially used or not, it makes IOs more random; - force concurrent cold data/GCed IO going into warm data area, it can make a bad effect on hot/cold data separation; In this patch, we introduce a new type of log named 'inmem curseg', the differents from normal curseg is: - it reuses existed segment type (CURSEG_XXX_NODE/DATA); - it only exists in memory, its segno, blkofs, summary will not b persisted into checkpoint area; With this new feature, we can enhance scalability of log, special allocators can be created for purposes: - pure lfs allocator for aligned pinfile allocation or file defragmentation - pure ssr allocator for later feature So that, let's update aligned pinfile allocation to use this new inmem curseg fwk. Signed-off-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Xiaojun Wang authored
Since DUMMY_WRITTEN_PAGE and ATOMIC_WRITTEN_PAGE have already been converted as unsigned long type, we don't need do type casting again. Signed-off-by:
Xiaojun Wang <wangxiaojun11@huawei.com> Reported-by:
Jack Qiu <jack.qiu@huawei.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
Aravind Ramesh authored
NVMe Zoned Namespace devices can have zone-capacity less than zone-size. Zone-capacity indicates the maximum number of sectors that are usable in a zone beginning from the first sector of the zone. This makes the sectors sectors after the zone-capacity till zone-size to be unusable. This patch set tracks zone-size and zone-capacity in zoned devices and calculate the usable blocks per segment and usable segments per section. If zone-capacity is less than zone-size mark only those segments which start before zone-capacity as free segments. All segments at and beyond zone-capacity are treated as permanently used segments. In cases where zone-capacity does not align with segment size the last segment will start before zone-capacity and end beyond the zone-capacity of the zone. For such spanning segments only sectors within the zone-capacity are used. During writes and GC manage the usable segments in a section and usable blocks per segment. Segments which are beyond zone-capacity are never allocated, and do not need to be garbage collected, only the segments which are before zone-capacity needs to garbage collected. For spanning segments based on the number of usable blocks in that segment, write to blocks only up to zone-capacity. Zone-capacity is device specific and cannot be configured by the user. Since NVMe ZNS device zones are sequentially write only, a block device with conventional zones or any normal block device is needed along with the ZNS device for the metadata operations of F2fs. A typical nvme-cli output of a zoned device shows zone start and capacity and write pointer as below: SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones are in EMPTY state. For each zone, only zone start + 49MB is usable area, any lba/sector after 49MB cannot be read or written to, the drive will fail any attempts to read/write. So, the second zone starts at 64MB and is usable till 113MB (64 + 49) and the range between 113 and 128MB is again unusable. The next zone starts at 128MB, and so on. Signed-off-by:
Aravind Ramesh <aravind.ramesh@wdc.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-
- 09 Sep, 2020 1 commit
-
-
Shin'ichiro Kawasaki authored
Commit da52f8ad ("f2fs: get the right gc victim section when section has several segments") added code to count blocks of each section using variables with type 'unsigned short', which has 2 bytes size in many systems. However, the counts can be larger than the 2 bytes range and type conversion results in wrong values. Especially when the f2fs sections have blocks as many as USHRT_MAX + 1, the count is handled as 0. This triggers eternal loop in init_dirty_segmap() at mount system call. Fix this by changing the type of the variables to block_t. Fixes: da52f8ad ("f2fs: get the right gc victim section when section has several segments") Signed-off-by:
Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Jaegeuk Kim <jaegeuk@kernel.org>
-