- 27 May, 2020 33 commits
-
-
Sagi Grimberg authored
When removing a namespace, we add an NS_CHANGE async event, however if the controller admin queue is removed after the event was added but not yet processed, we won't free the aens, resulting in the below memory leak [1]. Fix that by moving nvmet_async_event_free to the final controller release after it is detached from subsys->ctrls ensuring no async events are added, and modify it to simply remove all pending aens. -- $ cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888c1af2c000 (size 32): comm "nvmetcli", pid 5164, jiffies 4295220864 (age 6829.924s) hex dump (first 32 bytes): 28 01 82 3b 8b 88 ff ff 28 01 82 3b 8b 88 ff ff (..;....(..;.... 02 00 04 65 76 65 6e 74 5f 66 69 6c 65 00 00 00 ...event_file... backtrace: [<00000000217ae580>] nvmet_add_async_event+0x57/0x290 [nvmet] [<0000000012aa2ea9>] nvmet_ns_changed+0x206/0x300 [nvmet] [<00000000bb3fd52e>] nvmet_ns_disable+0x367/0x4f0 [nvmet] [<00000000e91ca9ec>] nvmet_ns_free+0x15/0x180 [nvmet] [<00000000a15deb52>] config_item_release+0xf1/0x1c0 [<000000007e148432>] configfs_rmdir+0x555/0x7c0 [<00000000f4506ea6>] vfs_rmdir+0x142/0x3c0 [<0000000000acaaf0>] do_rmdir+0x2b2/0x340 [<0000000034d1aa52>] do_syscall_64+0xa5/0x4d0 [<00000000211f13bc>] entry_SYSCALL_64_after_hwframe+0x6a/0xdf Fixes: a07b4970 ("nvmet: add a generic NVMe target") Reported-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Israel Rukshin authored
For capable HCAs (e.g. ConnectX-5/ConnectX-6) this will allow end-to-end protection information passthrough and validation for NVMe over RDMA transport. Metadata support was implemented over the new RDMA signature verbs API. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Israel Rukshin authored
Allocate the metadata SGL buffers and set metadata fields for the request. Then create a block IO request for the metadata from the protection SG list. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Israel Rukshin authored
Expose the namespace metadata format when PI is enabled. The user needs to enable the capability per subsystem and per port. The other metadata properties are taken from the namespace/bdev. Usage example: echo 1 > /config/nvmet/subsystems/${NAME}/attr_pi_enable echo 1 > /config/nvmet/ports/${PORT_NUM}/param_pi_enable Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Israel Rukshin authored
The enumerations will be used to expose the namespace metadata format by the target. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Israel Rukshin authored
The function doesn't check only the data length, because the transfer length includes also the metadata length in some cases. This is preparation for adding metadata (T10-PI) support. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Israel Rukshin authored
The function doesn't add the metadata length (only data length is calculated). This is preparation for adding metadata (T10-PI) support. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Israel Rukshin authored
Fill those namespace fields from the block device format for adding metadata (T10-PI) over fabric support with block devices. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
For capable HCAs (e.g. ConnectX-5/ConnectX-6) this will allow end-to-end protection information passthrough and validation for NVMe over RDMA transport. Metadata offload support was implemented over the new RDMA signature verbs API and it is enabled for capable controllers. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Israel Rukshin authored
Remove first_sgl pointer from struct nvme_rdma_request and use pointer arithmetic instead. The inline scatterlist, if exists, will be located right after the nvme_rdma_request. This patch is needed as a preparation for adding PI support. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Israel Rukshin authored
SGL size of metadata is usually small. Thus, 1 inline sg should cover most cases. The macro will be used for pre-allocate a single SGL entry for metadata. The preallocation of small inline SGLs depends on SG_CHAIN capability so if the ARCH doesn't support SG_CHAIN, use the runtime allocation for the SGL. This patch is a preparation for adding metadata (T10-PI) over fabric support. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
An extended LBA is a larger LBA that is created when metadata associated with the LBA is transferred contiguously with the LBA data (AKA interleaved). The metadata may be either transferred as part of the LBA (creating an extended LBA) or it may be transferred as a separate contiguous buffer of data. According to the NVMeoF spec, a fabrics ctrl supports only an Extended LBA format. Fail revalidation in case we have a spec violation. Also add a flag that will imply on capable transports and controllers as part of a preparation for allowing end-to-end protection information for fabric controllers. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
This patch doesn't change any logic, and is needed as a preparation for adding PI support for fabrics drivers that will use an extended LBA format for metadata and will support more than 1 integrity segment. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
James Smart authored
Move the nvme_ns_has_pi() inline from core.c to the nvme.h header. This allows use by the transports. Signed-off-by: James Smart <jsmart2021@gmail.com> [maxg: added a comment for nvme_ns_has_pi()] Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
This is a preparation for adding support for metadata in fabric controllers. New flag will imply that NVMe namespace supports getting metadata that was originally generated by host's block layer. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
Replace the specific ext boolean (that implies on extended LBA format) with a feature in the new namespace features flag. This is a preparation for adding more namespace features (such as metadata specific features). Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
This will reduce the amount of ifdefs inside the source code for various drivers and also will reduce the amount of stub functions that were created for the !CONFIG_BLK_DEV_INTEGRITY case. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
Add a new attribute "revalidate_size" for the namespace which allows user to revalidate and generate the AEN if needed. This attribute is needed so that we can install userspace rules with systemd service based on inotify/fsnotify/uevent. The registered callback for such a service will end up writing to this attribute to generate AEN if needed. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimbeg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
The newly added function nvmet_ns_revalidate() does update the ns size in the identify namespace in-core target data structure when host issues id-ns command. This can lead to host having inconsistencies between size of the namespace present in the id-ns command result and size of the corresponding block device until host scans the namespaces explicitly. To avoid this scenario generate AEN if old size is not same as the new one in nvmet_ns_revalidate(). This will allow automatic AEN generation when host calls id-ns command and also allows target to install userspace rules so that it can trigger nvmet_ns_revalidate() (using configfs interface with the help of next patch) resulting in appropriate AEN generation when underlying namespace size change is detected. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimbeg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
This patch adds a wrapper helper to indicate size change in the bdev & file-backed namespace when revalidating ns. This helper is needed in order to minimize code repetition in the next patch for configfs.c and existing admin-cmd.c. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimbeg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Chaitanya Kulkarni authored
This adds a new tracepoint for the target to trace async event. This is helpful in debugging and comparing host and target side async events especially when host is connected to different targets on different machines and now that we rely on userspace components to generate AEN. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Dan Carpenter authored
The nvme_put_ctrl() is implemented earlier as an inline function so this declaration isn't required. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Gustavo A. R. Silva authored
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Damien Le Moal authored
Currently, a namespace io_opt queue limit is set by default to the physical sector size of the namespace and to the the write optimal size (NOWS) when the namespace reports optimal IO sizes. This causes problems with block limits stacking in blk_stack_limits() when a namespace block device is combined with an HDD which generally do not report any optimal transfer size (io_opt limit is 0). The code: /* Optimal I/O a multiple of the physical block size? */ if (t->io_opt & (t->physical_block_size - 1)) { t->io_opt = 0; t->misaligned = 1; ret = -1; } in blk_stack_limits() results in an error return for this function when the combined devices have different but compatible physical sector sizes (e.g. 512B sector SSD with 4KB sector disks). Fix this by not setting the optimal IO size queue limit if the namespace does not report an optimal write size value. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bart van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Wu Bo authored
Disable streams again if getting the stream params fails. Signed-off-by: Wu Bo <wubo40@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Martin George authored
The nvme-fc devloss_tmo is computed as the min of either the ctrl_loss_tmo (max_retries * reconnect_delay) or the remote port's devloss_tmo. But what gets printed as the nvme-fc devloss_tmo in nvme_fc_reconnect_or_delete() is always the remote port's devloss_tmo value. So correct this by printing the min value instead. Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Weiping Zhang authored
Check module parameter write/poll_queues before using it to catch too large values. Reproducer: modprobe -r nvme modprobe nvme write_queues=`nproc` echo $((`nproc`+1)) > /sys/module/nvme/parameters/write_queues echo 1 > /sys/block/nvme0n1/device/reset_controller [ 657.069000] ------------[ cut here ]------------ [ 657.069022] WARNING: CPU: 10 PID: 1163 at kernel/irq/affinity.c:390 irq_create_affinity_masks+0x47c/0x4a0 [ 657.069056] dm_region_hash dm_log dm_mod [ 657.069059] CPU: 10 PID: 1163 Comm: kworker/u193:9 Kdump: loaded Tainted: G W 5.6.0+ #8 [ 657.069060] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019 [ 657.069064] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 657.069066] RIP: 0010:irq_create_affinity_masks+0x47c/0x4a0 [ 657.069067] Code: fe ff ff 48 c7 c0 b0 89 14 95 48 89 46 20 e9 e9 fb ff ff 31 c0 e9 90 fc ff ff 0f 0b 48 c7 44 24 08 00 00 00 00 e9 e9 fc ff ff <0f> 0b e9 87 fe ff ff 48 8b 7c 24 28 e8 33 a0 80 00 e9 b6 fc ff ff [ 657.069068] RSP: 0018:ffffb505ce1ffc78 EFLAGS: 00010202 [ 657.069069] RAX: 0000000000000060 RBX: ffff9b97921fe5c0 RCX: 0000000000000000 [ 657.069069] RDX: ffff9b67bad80000 RSI: 00000000ffffffa0 RDI: 0000000000000000 [ 657.069070] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9b97921fe718 [ 657.069070] R10: ffff9b97921fe710 R11: 0000000000000001 R12: 0000000000000064 [ 657.069070] R13: 0000000000000060 R14: 0000000000000000 R15: 0000000000000001 [ 657.069071] FS: 0000000000000000(0000) GS:ffff9b67c0880000(0000) knlGS:0000000000000000 [ 657.069072] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 657.069072] CR2: 0000559eac6fc238 CR3: 000000057860a002 CR4: 00000000007606e0 [ 657.069073] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 657.069073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 657.069073] PKRU: 55555554 [ 657.069074] Call Trace: [ 657.069080] __pci_enable_msix_range+0x233/0x5a0 [ 657.069085] ? kernfs_put+0xec/0x190 [ 657.069086] pci_alloc_irq_vectors_affinity+0xbb/0x130 [ 657.069089] nvme_reset_work+0x6e6/0xeab [nvme] [ 657.069093] ? __switch_to_asm+0x34/0x70 [ 657.069094] ? __switch_to_asm+0x40/0x70 [ 657.069095] ? nvme_irq_check+0x30/0x30 [nvme] [ 657.069098] process_one_work+0x1a7/0x370 [ 657.069101] worker_thread+0x1c9/0x380 [ 657.069102] ? max_active_store+0x80/0x80 [ 657.069103] kthread+0x112/0x130 [ 657.069104] ? __kthread_parkme+0x70/0x70 [ 657.069105] ret_from_fork+0x35/0x40 [ 657.069106] ---[ end trace f4f06b7d24513d06 ]--- [ 657.077110] nvme nvme0: 95/1/0 default/read/poll queues Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
Have routines handle errors and just bail out of the poll loop. This simplifies the code and will help as we may enhance the poll loop logic and these are somewhat in the way. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
when trying to send the pdu data digest, we should set this flag. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
We can signal the stack that this is not the last page coming and the stack can build a larger tso segment, so go ahead and use it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
We can signal the stack that this is not the last page coming and the stack can build a larger tso segment, so go ahead and use it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Chen Zhou authored
It is more efficient to use kmemdup_nul() if the size is known exactly. The doc in kernel: "Note: Use kmemdup_nul() instead if the size is known exactly." Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
- 26 May, 2020 1 commit
-
-
Jiri Kosina authored
Since the switch of floppy driver to blk-mq, the contended (fdc_busy) case in floppy_queue_rq() is not handled correctly. In case we reach floppy_queue_rq() with fdc_busy set (i.e. with the floppy locked due to another request still being in-flight), we put the request on the list of requests and return BLK_STS_OK to the block core, without actually scheduling delayed work / doing further processing of the request. This means that processing of this request is postponed until another request comes and passess uncontended. Which in some cases might actually never happen and we keep waiting indefinitely. The simple testcase is for i in `seq 1 2000`; do echo -en $i '\r'; blkid --info /dev/fd0 2> /dev/null; done run in quemu. That reliably causes blkid eventually indefinitely hanging in __floppy_read_block_0() waiting for completion, as the BIO callback never happens, and no further IO is ever submitted on the (non-existent) floppy device. This was observed reliably on qemu-emulated device. Fix that by not queuing the request in the contended case, and return BLK_STS_RESOURCE instead, so that blk core handles the request rescheduling and let it pass properly non-contended later. Fixes: a9f38e1d ("floppy: convert to blk-mq") Cc: stable@vger.kernel.org Tested-by: Libor Pechacek <lpechacek@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 24 May, 2020 1 commit
-
-
Colin Ian King authored
The variable error is being assigned a value that is never read so the assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 21 May, 2020 5 commits
-
-
Christoph Hellwig authored
No callers left. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Stefan Haberland authored
The IBM partition parser requires device type specific information only available to the DASD driver to correctly register partitions. The current approach of using ioctl_by_bdev with a fake user space pointer is discouraged. Fix this by replacing IOCTL calls with direct in-kernel function calls. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Prepare for in-kernel callers of this functionality. Signed-off-by: Christoph Hellwig <hch@lst.de> [sth@de.ibm.com: remove leftover kfree] Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Martijn Coenen authored
This allows userspace to completely setup a loop device with a single ioctl, removing the in-between state where the device can be partially configured - eg the loop device has a backing file associated with it, but is reading from the wrong offset. Besides removing the intermediate state, another big benefit of this ioctl is that LOOP_SET_STATUS can be slow; the main reason for this slowness is that LOOP_SET_STATUS(64) calls blk_mq_freeze_queue() to freeze the associated queue; this requires waiting for RCU synchronization, which I've measured can take about 15-20ms on this device on average. In addition to doing what LOOP_SET_STATUS can do, LOOP_CONFIGURE can also be used to: - Set the correct block size immediately by setting loop_config.block_size (avoids LOOP_SET_BLOCK_SIZE) - Explicitly request direct I/O mode by setting LO_FLAGS_DIRECT_IO in loop_config.info.lo_flags (avoids LOOP_SET_DIRECT_IO) - Explicitly request read-only mode by setting LO_FLAGS_READ_ONLY in loop_config.info.lo_flags Here's setting up ~70 regular loop devices with an offset on an x86 Android device, using LOOP_SET_FD and LOOP_SET_STATUS: vsoc_x86:/system/apex # time for i in `seq 30 100`; do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done 0m03.40s real 0m00.02s user 0m00.03s system Here's configuring ~70 devices in the same way, but using a modified losetup that uses the new LOOP_CONFIGURE ioctl: vsoc_x86:/system/apex # time for i in `seq 30 100`; do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done 0m01.94s real 0m00.01s user 0m00.01s system Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Martijn Coenen authored
LOOP_SET_STATUS(64) will actually allow some lo_flags to be modified; in particular, LO_FLAGS_AUTOCLEAR can be set and cleared, whereas LO_FLAGS_PARTSCAN can be set to request a partition scan. Make this explicit by updating the UAPI to include the flags that can be set/cleared using this ioctl. The implementation can then blindly take over the passed in flags, and use the previous flags for those flags that can't be set / cleared using LOOP_SET_STATUS. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-