Commits · d5c0ed17fea60cca9bc3bf1278b49ba79242bbcd · Kirill Smelkov / linux

19 Mar, 2024 31 commits

virtio: packed: fix unmap leak for indirect desc table · d5c0ed17

Xuan Zhuo authored Feb 23, 2024

When use_dma_api and premapped are true, then the do_unmap is false.

Because the do_unmap is false, vring_unmap_extra_packed is not called by
detach_buf_packed.

  if (unlikely(vq->do_unmap)) {
                curr = id;
                for (i = 0; i < state->num; i++) {
                        vring_unmap_extra_packed(vq,
                                                 &vq->packed.desc_extra[curr]);
                        curr = vq->packed.desc_extra[curr].next;
                }
  }

So the indirect desc table is not unmapped. This causes the unmap leak.

So here, we check vq->use_dma_api instead. Synchronously, dma info is
updated based on use_dma_api judgment

This bug does not occur, because no driver use the premapped with
indirect.

Fixes: b319940f ("virtio_ring: skip unmap for premapped")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20240223071833.26095-1-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

d5c0ed17

vDPA: report virtio-blk flush info to user space · 1ac61ddf

Zhu Lingshan authored Feb 19, 2024

This commit reports whether a virtio-blk device
support cache flush command to user space
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-11-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

1ac61ddf

vDPA: report virtio-block read-only info to user space · ae1374b7

Zhu Lingshan authored Feb 19, 2024

This commit report read-only information of
virtio-blk devices to user space.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-10-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ae1374b7

vDPA: report virtio-block write zeroes configuration to user space · 6bdc7846

Zhu Lingshan authored Feb 19, 2024

This commits reports write zeroes configuration of
virtio-block devices to user space, includes:
1)maximum write zeroes sectors size
2)maximum write zeroes segment number
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-9-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

6bdc7846

vDPA: report virtio-block discarding configuration to user space · 65848f46

Zhu Lingshan authored Feb 19, 2024

This commit reports virtio-blk discarding configuration
to user space,includes:
1) the maximum discard sectors
2) maximum number of discard segments for the block driver to use
3) the alignment for splitting a discarding request
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-8-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

65848f46

vDPA: report virtio-block topology info to user space · c9d989b4

Zhu Lingshan authored Feb 19, 2024

This commit allows vDPA reporting topology information of
virtio-blk devices to user space, includes:
1) the number of logical blocks per physical block
2) offset of first aligned logical block
3) suggested minimum I/O size in blocks
4) optimal (suggested maximum) I/O size in blocks
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-7-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

c9d989b4

vDPA: report virtio-block MQ info to user space · 54fb04b0

Zhu Lingshan authored Feb 19, 2024

This commits allows vDPA reporting virtio-block multi-queue
configuration to user sapce.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-6-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

54fb04b0

vDPA: report virtio-block max segments in a request to user space · 81f64e1d

Zhu Lingshan authored Feb 19, 2024

This commit allows vDPA reporting the maximum number of
segments in a request of virtio-block devices to
user space.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-5-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

81f64e1d

vDPA: report virtio-block block-size to user space · 3a1d33fb

Zhu Lingshan authored Feb 19, 2024

This commit allows reporting the block size of a
virtio-block device to user space.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-4-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

3a1d33fb

vDPA: report virtio-block max segment size to user space · 330b8aea

Zhu Lingshan authored Feb 19, 2024

This commit allows reporting the max size of any
single segment of virtio-block devices to user space.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-3-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

330b8aea

vDPA: report virtio-block capacity to user space · c2475a9a

Zhu Lingshan authored Feb 19, 2024

This commit allows userspace to query capacity of
a virtio-block device.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240218185606.13509-2-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

c2475a9a

virtio: make virtio_bus const · 2b666ee2

Ricardo B. Marliere authored Feb 04, 2024

Now that the driver core can properly handle constant struct bus_type,
move the virtio_bus variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Message-Id: <20240204-bus_cleanup-virtio-v1-1-3bcb2212aaa0@marliere.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Jason Wang <jasowang@redhat.com>

2b666ee2

vdpa: make vdpa_bus const · 8169ed62

Ricardo B. Marliere authored Feb 04, 2024

Now that the driver core can properly handle constant struct bus_type,
move the vdpa_bus variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Message-Id: <20240204-bus_cleanup-vdpa-v1-1-1745eccb0a5c@marliere.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

8169ed62

vDPA/ifcvf: implement vdpa_config_ops.get_vq_num_min · 56d61ae5

Zhu Lingshan authored Feb 03, 2024

IFCVF HW supports operation with vq size less than the max size,
as the spec required.

This commit implements vdpa_config_ops.get_vq_num_min to report
the minimal size of the virtqueues, which gives vDPA framework
a chance to reduce the vring size.

We need at least one descriptor to be functional, but it is better
no less than 64 to meet ceratin performance requirements.
Actually the framework would allocate at least a PAGE for the vq.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-11-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

56d61ae5

vDPA/ifcvf: get_max_vq_size to return max size · cd214706

Zhu Lingshan authored Feb 03, 2024

Since we already implemented vdpa_config_ops.get_vq_size,
so get_max_vq_size can return the acutal max size of the
virtqueues other than the max allowed safe size.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-10-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

cd214706

virtio_vdpa: create vqs with the actual size · 273ae08f

Zhu Lingshan authored Feb 03, 2024

The size of a virtqueue is a per vq configuration,
this commit allows virtio_vdpa to create
virtqueues with the actual size of a specific
vq size that supported by the backend device.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-9-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

273ae08f

vduse: implement vdpa_config_ops.get_vq_size for vduse · 47e62e6d

Zhu Lingshan authored Feb 03, 2024

This commit implements get_vq_size for vdpa_config_ops. This
new interface is used to report per vq size.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-8-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

47e62e6d

vdpa_sim: implement vdpa_config_ops.get_vq_size for vDPA simulator · f6fa2f7e

Zhu Lingshan authored Feb 03, 2024

This commit implements vdpa_config_ops.get_vq_size for vDPA
simulator, this new interface can help report per vq size.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-7-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

f6fa2f7e

eni_vdpa: implement vdpa_config_ops.get_vq_size · 1da13e64

Zhu Lingshan authored Feb 03, 2024

This commit implements get_vq_size which report
per vq size in vdpa_config_ops
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-6-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

1da13e64

vp_vdpa: implement vdpa_config_ops.get_vq_size · a97f9c8f

Zhu Lingshan authored Feb 03, 2024

This commit implements vdpa_config_ops.get_vq_size in
vp_vdpa, which reports per virtqueue size.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-5-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

a97f9c8f

vDPA/ifcvf: implement vdpa_config_ops.get_vq_size · 36503e5e

Zhu Lingshan authored Feb 03, 2024

This commit implements vdpa_ops.get_vq_size to report
the size of a specific virtqueue.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-4-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

36503e5e

vDPA: introduce get_vq_size to vdpa_config_ops · 0a926fc9

Zhu Lingshan authored Feb 03, 2024

This commit introduces a new interface get_vq_size to
vDPA config ops, this new interface intends to report
the size of a specific virtqueue
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-3-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

0a926fc9

vhost-vdpa: uapi to support reporting per vq size · 1496c470

Zhu Lingshan authored Feb 03, 2024

The size of a virtqueue is a per vq configuration.
This commit introduce a new ioctl uAPI to support this flexibility.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240202163905.8834-2-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

1496c470

vdpa/pds: fixes for VF vdpa flr-aer handling · ba6faaa6

Shannon Nelson authored Feb 19, 2024

This addresses a couple of things found while testing the FLR and AER
handling with the VFs.
 - release irqs before calling vp_modern_remove()
 - make sure we have a valid struct pointer before using it to release irqs
 - make sure the FW is alive before trying to add a new device
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Message-Id: <20240220011050.30913-1-shannon.nelson@amd.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ba6faaa6

vduse: implement DMA sync callbacks · d7b4e328

Maxime Coquelin authored Feb 19, 2024

Since commit 295525e2 ("virtio_net: merge dma
operations when filling mergeable buffers"), VDUSE device
require support for DMA's .sync_single_for_cpu() operation
as the memory is non-coherent between the device and CPU
because of the use of a bounce buffer.

This patch implements both .sync_single_for_cpu() and
.sync_single_for_device() callbacks, and also skip bounce
buffer copies during DMA map and unmap operations if the
DMA_ATTR_SKIP_CPU_SYNC attribute is set to avoid extra
copies of the same buffer.
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-Id: <20240219170606.587290-1-maxime.coquelin@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

d7b4e328

vdpa/mlx5: Allow CVQ size changes · 749a4016

Jonah Palmer authored Feb 16, 2024

The MLX driver was not updating its control virtqueue size at set_vq_num
and instead always initialized to MLX5_CVQ_MAX_ENT (16) at
setup_cvq_vring.

Qemu would try to set the size to 64 by default, however, because the
CVQ size always was initialized to 16, an error would be thrown when
sending >16 control messages (as used-ring entry 17 is initialized to 0).
For example, starting a guest with x-svq=on and then executing the
following command would produce the error below:

 # for i in {1..20}; do ifconfig eth0 hw ether XX:xx:XX:xx:XX:XX; done

 qemu-system-x86_64: Insufficient written data (0)
 [  435.331223] virtio_net virtio0: Failed to set mac address by vq command.
 SIOCSIFHWADDR: Invalid argument
Acked-by: Dragos Tatulea <dtatulea@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
Message-Id: <20240216142502.78095-1-jonah.palmer@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Fixes: 5262912e ("vdpa/mlx5: Add support for control VQ and MAC setting")

749a4016

vdpa: skip suspend/resume ops if not DRIVER_OK · c4e8b5ae

Steve Sistare authored Feb 13, 2024

If a vdpa device is not in state DRIVER_OK, then there is no driver state
to preserve, so no need to call the suspend and resume driver ops.

Suggested-by: Eugenio Perez Martin <eperezma@redhat.com>"
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Message-Id: <1707834358-165470-1-git-send-email-steven.sistare@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>

c4e8b5ae

virtio: reenable config if freezing device failed · 310227f4

David Hildenbrand authored Feb 13, 2024

Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a ("virtio: defer config changed notifications")
Cc: <stable@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20240213135425.795001-1-david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

310227f4

vdpa_sim: reset must not run · 9588e7fc

Steve Sistare authored Feb 09, 2024

vdpasim_do_reset sets running to true, which is wrong, as it allows
vdpasim_kick_vq to post work requests before the device has been
configured.  To fix, do not set running until VIRTIO_CONFIG_S_DRIVER_OK
is set.

Fixes: 0c89e2a3 ("vdpa_sim: Implement suspend vdpa op")
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <1707517807-137331-1-git-send-email-steven.sistare@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

9588e7fc

virtio: uapi: Drop __packed attribute in linux/virtio_pci.h · ec6ecb84

Suzuki K Poulose authored Jan 25, 2024

Commit 92792ac7 ("virtio-pci: Introduce admin command sending function")
added "__packed" structures to UAPI header linux/virtio_pci.h. This triggers
build failures in the consumer userspace applications without proper "definition"
of __packed (e.g., kvmtool build fails).

Moreover, the structures are already packed well, and doesn't need explicit
packing, similar to the rest of the structures in all virtio_* headers. Remove
the __packed attribute.

Fixes: 92792ac7 ("virtio-pci: Introduce admin command sending function")
Cc: Feng Liu <feliu@nvidia.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Yishai Hadas <yishaih@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Message-Id: <20240125232039.913606-1-suzuki.poulose@arm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ec6ecb84

vhost: Added pad cleanup if vnet_hdr is not present. · f6baca2d

Andrew Melnychenko authored Jan 15, 2024

When the Qemu launched with vhost but without tap vnet_hdr,
vhost tries to copy vnet_hdr from socket iter with size 0
to the page that may contain some trash.
That trash can be interpreted as unpredictable values for
vnet_hdr.
That leads to dropping some packets and in some cases to
stalling vhost routine when the vhost_net tries to process
packets and fails in a loop.

Qemu options:
  -netdev tap,vhost=on,vnet_hdr=off,...
Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
Message-Id: <20240115194840.1183077-1-andrew@daynix.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

f6baca2d

10 Mar, 2024 7 commits

Linux 6.8 · e8f897f4
Linus Torvalds authored Mar 10, 2024

e8f897f4

Merge tag 'trace-ring-buffer-v6.8-rc7' of... · fa4b851b

Linus Torvalds authored Mar 10, 2024

Merge tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

- Do not allow large strings (> 4096) as single write to trace_marker

The size of a string written into trace_marker was determined by the
size of the sub-buffer in the ring buffer. That size is dependent on
the PAGE_SIZE of the architecture as it can be mapped into user
space. But on PowerPC, where PAGE_SIZE is 64K, that made the limit of
the string of writing into trace_marker 64K.

One of the selftests looks at the size of the ring buffer sub-buffers
and writes that plus more into the trace_marker. The write will take
what it can and report back what it consumed so that the user space
application (like echo) will write the rest of the string. The string
is stored in the ring buffer and can be read via the "trace" or
"trace_pipe" files.

The reading of the ring buffer uses vsnprintf(), which uses a
precision "%.*s" to make sure it only reads what is stored in the
buffer, as a bug could cause the string to be non terminated.

With the combination of the precision change and the PAGE_SIZE of 64K
allowing huge strings to be added into the ring buffer, plus the test
that would actually stress that limit, a bug was reported that the
precision used was too big for "%.*s" as the string was close to 64K
in size and the max precision of vsnprintf is 32K.

Linus suggested not to have that precision as it could hide a bug if
the string was again stored without a nul byte.

Another issue that was brought up is that the trace_seq buffer is
also based on PAGE_SIZE even though it is not tied to the
architecture limit like the ring buffer sub-buffer is. Having it be
64K * 2 is simply just too big and wasting memory on systems with 64K
page sizes. It is now hardcoded to 8K which is what all other
architectures with 4K PAGE_SIZE has.

Finally, the write to trace_marker is now limited to 4K as there is
no reason to write larger strings into trace_marker.

- ring_buffer_wait() should not loop.

The ring_buffer_wait() does not have the full context (yet) on if it
should loop or not. Just exit the loop as soon as its woken up and
let the callers decide to loop or not (they already do, so it's a bit
redundant).

- Fix shortest_full field to be the smallest amount in the ring buffer
that a waiter is waiting for. The "shortest_full" field is updated
when a new waiter comes in and wants to wait for a smaller amount of
data in the ring buffer than other waiters. But after all waiters are
woken up, it's not reset, so if another waiter comes in wanting to
wait for more data, it will be woken up when the ring buffer has a
smaller amount from what the previous waiters were waiting for.

- The wake up all waiters on close is incorrectly called frome
.release() and not from .flush() so it will never wake up any waiters
as the .release() will not get called until all .read() calls are
finished. And the wakeup is for the waiters in those .read() calls.

* tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Use .flush() call to wake up readers
ring-buffer: Fix resetting of shortest_full
ring-buffer: Fix waking up ring buffer readers
tracing: Limit trace_marker writes to just 4K
tracing: Limit trace_seq size to just 8K and not depend on architecture PAGE_SIZE
tracing: Remove precision vsnprintf() check from print event

fa4b851b

Merge tag 'phy-fixes3-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy · 210ee636

Linus Torvalds authored Mar 10, 2024

Pull phy fixes from Vinod Koul:

 - fixes for Qualcomm qmp-combo driver for ordering of drm and type-c
   switch registartion due to drivers might not probe defer after having
   registered child devices to avoid triggering a probe deferral loop.

   This fixes internal display on Lenovo ThinkPad X13s

* tag 'phy-fixes3-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
  phy: qcom-qmp-combo: fix type-c switch registration
  phy: qcom-qmp-combo: fix drm bridge registration

210ee636

tracing: Use .flush() call to wake up readers · e5d7c191

Steven Rostedt (Google) authored Mar 08, 2024

The .release() function does not get called until all readers of a file
descriptor are finished.

If a thread is blocked on reading a file descriptor in ring_buffer_wait(),
and another thread closes the file descriptor, it will not wake up the
other thread as ring_buffer_wake_waiters() is called by .release(), and
that will not get called until the .read() is finished.

The issue originally showed up in trace-cmd, but the readers are actually
other processes with their own file descriptors. So calling close() would wake
up the other tasks because they are blocked on another descriptor then the
one that was closed(). But there's other wake ups that solve that issue.

When a thread is blocked on a read, it can still hang even when another
thread closed its descriptor.

This is what the .flush() callback is for. Have the .flush() wake up the
readers.

Link: https://lore.kernel.org/linux-trace-kernel/20240308202432.107909457@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: f3ddb74a ("tracing: Wake up ring buffer waiters on closing of the file")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

e5d7c191

ring-buffer: Fix resetting of shortest_full · 68282dd9

Steven Rostedt (Google) authored Mar 08, 2024

The "shortest_full" variable is used to keep track of the waiter that is
waiting for the smallest amount on the ring buffer before being woken up.
When a tasks waits on the ring buffer, it passes in a "full" value that is
a percentage. 0 means wake up on any data. 1-100 means wake up from 1% to
100% full buffer.

As all waiters are on the same wait queue, the wake up happens for the
waiter with the smallest percentage.

The problem is that the smallest_full on the cpu_buffer that stores the
smallest amount doesn't get reset when all the waiters are woken up. It
does get reset when the ring buffer is reset (echo > /sys/kernel/tracing/trace).

This means that tasks may be woken up more often then when they want to
be. Instead, have the shortest_full field get reset just before waking up
all the tasks. If the tasks wait again, they will update the shortest_full
before sleeping.

Also add locking around setting of shortest_full in the poll logic, and
change "work" to "rbwork" to match the variable name for rb_irq_work
structures that are used in other places.

Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.948914369@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: 2c2b0a78 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

68282dd9

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 137e0ec0

Linus Torvalds authored Mar 10, 2024

Pull kvm fixes from Paolo Bonzini:
 "KVM GUEST_MEMFD fixes for 6.8:

   - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
     to avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not
     writable from userspace, so there would be no way to write to a
     read-only guest_memfd).

   - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
     clear that such VMs are purely for development and testing.

   - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term
     plan is to support confidential VMs with deterministic private
     memory (SNP and TDX) only in the TDP MMU.

   - Fix a bug in a GUEST_MEMFD dirty logging test that caused false
     passes.

  x86 fixes:

   - Fix missing marking of a guest page as dirty when emulating an
     atomic access.

   - Check for mmu_notifier invalidation events before faulting in the
     pfn, and before acquiring mmu_lock, to avoid unnecessary work and
     lock contention with preemptible kernels (including
     CONFIG_PREEMPT_DYNAMIC in non-preemptible mode).

   - Disable AMD DebugSwap by default, it breaks VMSA signing and will
     be re-enabled with a better VM creation API in 6.10.

   - Do the cache flush of converted pages in svm_register_enc_region()
     before dropping kvm->lock, to avoid a race with unregistering of
     the same region and the consequent use-after-free issue"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  SEV: disable SEV-ES DebugSwap by default
  KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
  KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region()
  KVM: selftests: Add a testcase to verify GUEST_MEMFD and READONLY are exclusive
  KVM: selftests: Create GUEST_MEMFD for relevant invalid flags testcases
  KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU
  KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP
  KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
  KVM: x86: Mark target gfn of emulated atomic instruction as dirty

137e0ec0

ring-buffer: Fix waking up ring buffer readers · b3594573

Steven Rostedt (Google) authored Mar 08, 2024

A task can wait on a ring buffer for when it fills up to a specific
watermark. The writer will check the minimum watermark that waiters are
waiting for and if the ring buffer is past that, it will wake up all the
waiters.

The waiters are in a wait loop, and will first check if a signal is
pending and then check if the ring buffer is at the desired level where it
should break out of the loop.

If a file that uses a ring buffer closes, and there's threads waiting on
the ring buffer, it needs to wake up those threads. To do this, a
"wait_index" was used.

Before entering the wait loop, the waiter will read the wait_index. On
wakeup, it will check if the wait_index is different than when it entered
the loop, and will exit the loop if it is. The waker will only need to
update the wait_index before waking up the waiters.

This had a couple of bugs. One trivial one and one broken by design.

The trivial bug was that the waiter checked the wait_index after the
schedule() call. It had to be checked between the prepare_to_wait() and
the schedule() which it was not.

The main bug is that the first check to set the default wait_index will
always be outside the prepare_to_wait() and the schedule(). That's because
the ring_buffer_wait() doesn't have enough context to know if it should
break out of the loop.

The loop itself is not needed, because all the callers to the
ring_buffer_wait() also has their own loop, as the callers have a better
sense of what the context is to decide whether to break out of the loop
or not.

Just have the ring_buffer_wait() block once, and if it gets woken up, exit
the function and let the callers decide what to do next.

Link: https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwWgPi0JgTe6OYUYMNSRZfg@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.792933613@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linke li <lilinke99@qq.com>
Cc: Rabin Vincent <rabin@rab.in>
Fixes: e30f53aa ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

b3594573

09 Mar, 2024 2 commits

Merge tag 'i2c-for-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 005f6f34

Linus Torvalds authored Mar 09, 2024

Pull i2c fixes from Wolfram Sang:
 "Two patches from Heiner for the i801 are targeting muxes discovered
  while working on some other features. Essentially, there is a
  reordering when adding optional slaves and proper cleanup upon
  registering a mux device.

  Christophe fixes the exit path in the wmt driver that was leaving the
  clocks hanging, and the last fix from Tommy avoids false error reports
  in IRQ"

* tag 'i2c-for-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: aspeed: Fix the dummy irq expected print
  i2c: wmt: Fix an error handling path in wmt_i2c_probe()
  i2c: i801: Avoid potential double call to gpiod_remove_lookup_table
  i2c: i801: Fix using mux_pdev before it's set

005f6f34

Merge tag 'firewire-fixes-6.8-final' of... · 66695e7d

Linus Torvalds authored Mar 09, 2024

Merge tag 'firewire-fixes-6.8-final' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

Pull firewire fix from Takashi Sakamoto:
 "A fix to suppress a warning about unreleased IRQ for 1394 OHCI
  hardware when disabling MSI.

  In Linux kernel v6.5, a PCI driver for 1394 OHCI hardware was
  optimized into the managed device resources. Edmund Raile points out
  that the change brings the warning about unreleased IRQ at the call of
  pci_disable_msi(), since the API expects that the relevant IRQ has
  already been released in advance.

  As long as the API is called in .remove callback of PCI device
  operation, it is prohibited to maintain the IRQ as the part of managed
  device resource. As a workaround, the IRQ is explicitly released at
  .remove callback, before the call of pci_disable_msi().

  pci_disable_msi() is legacy API nowadays in PCI MSI implementation. I
  have a plan to replace it with the modern API in the development for
  the future version of Linux kernel. So at present I keep them as is"

* tag 'firewire-fixes-6.8-final' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: ohci: prevent leak of left-over IRQ on unbind

66695e7d