Commits · cc5b4c4c75c43184c1d8684299efef2a68b87720 · Kirill Smelkov / linux

01 Sep, 2021 4 commits

habanalabs: clear msg_to_cpu_reg to avoid misread after reset · cc5b4c4c

Koby Elbaz authored Jul 19, 2021

For some ASICs, the f/w reads the msg_to_cpu_reg value after
reset, and for some it doesn't.
Therefore, to be sure f/w doesn't read a wrong value after reset, we
need to clear this register before the reset occurs.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

cc5b4c4c

habanalabs: make set_pci_regions asic function · b9317d51

Ohad Sharabi authored Jul 15, 2021

In order to better support variants of the same ASIC
the set_pci_regions function is now an ASIC function which
allows each ASIC to implement it internally, thus keeping
all definitions static to the file.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

b9317d51

habanalabs: convert PCI BAR offset to u64 · 932adf16

Ohad Sharabi authored Jul 18, 2021

Done as the bar size can exceed 4GB.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

932adf16

habanalabs: expose server type in INFO IOCTL · 5dc9ffaf

Oded Gabbay authored Jul 15, 2021

Add the server type property to the hl_info_hw_ip_info structure
that is exposed to the user via the INFO IOCTL.

This is needed by the userspace s/w stack to know the connections map
of the internal links that connect the ASIC among themselves inside the
server.

The F/W will tell us, as part of the NIC information, the server type
that the GAUDI is located in.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

5dc9ffaf

29 Aug, 2021 32 commits

habanalabs: remove redundant warning message · e62ada5e

Oded Gabbay authored Jul 14, 2021

This warning is redundant as we will print a notice in case the device
is still in use after the FD was closed. No need to print the same
message per context.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

e62ada5e

habanalabs: add support for encapsulated signals submission · e4cdccd2

farah kassabri authored May 26, 2021

This commit is the second part of the encapsulated signals feature.
It contains the driver support for submission of cs with encapsulated
signals and the wait for them.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

e4cdccd2

habanalabs: add support for encapsulated signals reservation · dadf17ab

farah kassabri authored May 24, 2021

The signaling from within encapsulated OP capability is merged into the
existing stream architecture, such that one can trigger multiple
signaling from an encapsulated op, according to the time the event
was done in the graph execution and avoid the need to wait for the
whole encapsulated OP execution to be complete before the stream can
signal.

This commit implements only the reserve/unreserve part.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

dadf17ab

habanalabs: signal/wait change sync object reset flow · 8ca2072e

farah kassabri authored Jun 20, 2021

Currently the SOB reset was in fence release function which happens
only at the CS wraparound during the CS allocation time.

In order to support the new encapsulated signals reservation feature,
we need to move the SOB reset to an earlier phase because this SOB
could reach it's max value very fast using the signal reservation.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

8ca2072e

habanalabs: add wait-for-multi-CS uAPI · 215f0c17

Ohad Sharabi authored Jun 14, 2021

When user sends multiple CSs, waiting for each CS is not efficient
as it involves many user-kernel context switches.

In order to address this issue we add support to "wait on multiple CSs"
using a new uAPI which can wait on maximum of 32 CSs. The new uAPI is
defined using a new flag - WAIT_FOR_MULTI_CS - in the wait_for_cs IOCTL.

The input parameters for this uAPI will be:
@seq: user pointer to an array of up to 32 CS's sequence numbers.
@seq_array_len: length of sequence array.
@timeout_us: timeout for waiting for any CS.

The output paramateres for this API will be:
@status: multi CS ioctl completion status (dedicated status was added as
         well).
@flags: bitmap of output flags of the CS.
@cs_completion_map: bitmap for multi CS, if CS sequence that was placed
                    in index N in input seq array has completed- the N-th
		    bit in cs_completion_map will be 1, otherwise it will
		    be 0.
@timestamp_nsec: timestamp of the first completed CS
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

215f0c17

habanalabs: get multiple fences under same cs_lock · c457d5ab

Ohad Sharabi authored Jun 20, 2021

To add proper support for wait-for-multi-CS, locking the CS lock
for each CS fence in the list is not efficient.

Instead, this patch add support to lock the CS lock once to get all
required fences.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

c457d5ab

habanalabs: revise prints on FD close · a6cd2551

Oded Gabbay authored Jul 13, 2021

The driver quietly handles memory mappings that were not freed so no
need to print a warning about that when user closes the FD.

Accordingly, revise the text that is printed in case the device is
still in use after the user process closed the FD.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

a6cd2551

habanalabs/goya: add missing initialization · 7886acb6

Oded Gabbay authored Jul 13, 2021

Need to initialize f/w Linux loaded indication to false to prevent
wrong communication with the f/w.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

7886acb6

habanalabs: update firmware header to latest version · 2a2c4b74

Oded Gabbay authored Jul 13, 2021

Add two new fields regarding interrupts communication between driver
and f/w.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

2a2c4b74

habanalabs: fix race between soft reset and heartbeat · 8bb8b505

Koby Elbaz authored Jul 06, 2021

There is a scenario where an ongoing soft reset would race with an
ongoing heartbeat routine, eventually causing heartbeat to fail and
thus to escalate into a hard reset.

With this fix, soft-reset procedure will disable heartbeat CPU messages
and flush the (ongoing) current one before continuing with reset code.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

8bb8b505

habanalabs/gaudi: fix information printed on SM event · ae2021d3

Oded Gabbay authored Jul 12, 2021

Print the SM name instead of index because it is more informational for
the user to know the SM name instead of id when a SM interrupt occurs.

In addition, the index that is printed is of the SOB group, not
a specific SOB.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

ae2021d3

habanalabs/gaudi: trigger state dump in case of SM errors · 7148e647

Ofir Bitton authored Jul 12, 2021

State dump is relevant to the user in case of Sync Manager error, so
we need to trigger it in that case as well.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

7148e647

habanalabs: set dma max segment size · a6946151

Oded Gabbay authored Jul 11, 2021

This is required from any device that is capable to perform DMA.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

a6946151

habanalabs: add asic property of host dma offset · 2b5bbef5

Oded Gabbay authored Jul 10, 2021

Each ASIC can have a different offset to add to a host dma address,
to enable the ASIC to access that host memory.

The usage for this can be common code so add this to the asic
property structure.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

2b5bbef5

habanalabs: fix type of variable · d18bf13e

Oded Gabbay authored Jul 10, 2021

Recently, the size parameter in userptr structure was change to u64.
As a result, we need to change the type of the local range_size
in device_va_to_pa() to u64 to avoid overflow.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

d18bf13e

habanalabs: mark linux image as not loaded after hw_fini · a9623a8b

Tomer Tayar authored Jul 09, 2021

If hard reset fails after the call to hw_fini and before loading the
linux image to the device, a subsequent call to hw_fini should
communicate via COMMS (or MSG_TO_CPU regs for old FW versions).
However, the driver still tries in this case to communicate via the GIC,
and thus no hard reset is actually done.
To avoid that, the patch clears the linux_loaded flag after every call
to hw_fini.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

a9623a8b

habanalabs: fix nullifying of destroyed mmu pgt pool · 89aad770

Tomer Tayar authored Jul 09, 2021

In case of host-resident MMU, when the page tables pool is destroyed,
its pointer is not nullified correctly.
As a result, on a device fini which happens after a failing reset, the
already destroyed pool is accessed, which leads to a kernel panic.
The patch fixes the setting of the pool pointer to NULL.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

89aad770

habanalabs: rename cb_mmap to mmap · 1ee8e2ba

Zvika Yehudai authored Jul 06, 2021

This function will be used for more mmap operations than just
mmaping CBs.
Signed-off-by: Zvika Yehudai <zyehudai@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

1ee8e2ba

habanalabs: missing mutex_unlock in process kill procedure · 40e35d19

Ofir Bitton authored Jul 06, 2021

missing mutex unlock once driver is giving up killing user processes.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

40e35d19

habanalabs/gaudi: implement state dump · 77977ac8

Yuri Nudelman authored Jun 06, 2021

At the first stage, only gaudi core dump shall be implemented, not
including the status registers.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

77977ac8

habanalabs: state dump monitors and fences infrastructure · fd2010b5

Yuri Nudelman authored Jun 09, 2021

With the infrastructure in place, monitors and fences dump shall be
implemented.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

fd2010b5

habanalabs: expose state dump · 938b793f

Yuri Nudelman authored Jun 06, 2021

To improve the user's ability to debug the case where a workload that
is part of executing training/inference of a topology is getting stuck,
we need to add a 'core dump' each time a CS times-out. The 'core dump'
shall contain all relevant Sync Manager information and corresponding
fence values.

The most recent dumps shall be accessible via debugfs, under
'state_dump' node. Reading from the node will provide the oldest dump
available. Writing an integer value X will discard X dumps, starting
with the oldest one, i.e. subsequent read will now return newer
dumps.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

938b793f

habanalabs: use get_task_pid() to take PID · e79e745b

Oded Gabbay authored Jul 03, 2021

The previous function we used, find_get_pid(), wasn't good in case
the user process was run inside docker.

As a result, we didn't had the PID and we couldn't kill the user
process in case the device got stuck and we needed to reset the
device.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

e79e745b

habanalabs: allow disabling huge page use · fbcd0efe

Oded Gabbay authored Jun 29, 2021

Sometimes we may need to disable optimization of using huge pages
in our memory management code. Add such a flag to the function that
creates the list of physical pages that would be programmed into the
device MMU.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

fbcd0efe

habanalabs: user mappings can be 64-bit · 00ce0653

Oded Gabbay authored Jun 29, 2021

Increase the size variable in the userptr structure to 64-bit. That
variable describes the size of the memory allocation of the user that
is now being mapped into the device. The mapping can be larger than
4GB, so we need to support it.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

00ce0653

habanalabs: handle case of interruptable wait · 429d77ca

Oded Gabbay authored Jul 01, 2021

Same as we handle it in the regular wait for CS, we need to handle the
case where the waiting for user interrupt was interrupted. In that case,
we need to return correct error code to the user.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

429d77ca

habanalabs: release pending user interrupts on device fini · b07e6c7e

Oded Gabbay authored Jul 01, 2021

In device fini there was missing a call to release all pending user
interrupts. That can cause a process to be stuck inside the driver's
IOCTL of wait for interrupts, in case the device is removed or
simulator is killed at the same time.

In addition, also call to remove inactive codec job was missing.

Moreover, to prevent such errors in the future (where code is added
to reset path but not to device fini), we moved some common parts
to two dedicated functions:
cleanup_resources
take_release_locks
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

b07e6c7e

habanalabs: re-init completion object upon retry · d5546d78

Oded Gabbay authored Jul 01, 2021

In case user interrupt arrived but the completion value is less than
the target value, we want to retry the wait.

However, before the retry we must reinitialize the completion object,
under spin-lock, so the wait function won't exit immediately because
the completion object is already completed (from the previous
interrupt).
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

d5546d78

habanalabs: rename enum vm_type_t to vm_type · 82629c71

Oded Gabbay authored Jun 29, 2021

We don't use typedefs so the enum name shouldn't end with _t
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

82629c71

habanalabs: update firmware header files · c67b0579

Ofir Bitton authored Jun 28, 2021

Update recent changes made in firmware header files, which contain
a minor COMMS protocol change and new error status definitions.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

c67b0579

habanalabs: allow fail on inability to respect hint · 486e1979

Yuri Nudelman authored Jun 03, 2021

A new user flag is required to make memory map hint mandatory, in
contrast to the current situation where it is best effort.
This is due to the requirement to map certain data to specific
pre-determined device virtual address ranges.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

486e1979

habanalabs: support hint addresses range reservation · 1ae32b90

farah kassabri authored Jan 31, 2021

Add support for pre-determined driver-reserved device VA address ranges.
This is needed for future ASIC support where some contents must be
mapped into these pre-determined ranges because the H/W will be
configured using these ranges.

In case the user asks to map a VA without a hint address, avoid
allocating the device VA from the reserved ranges.

Make sure the validation checks of the hint address take into account
situation where the DRAM page size is not pow of 2.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

1ae32b90

27 Aug, 2021 4 commits

Revert "bus: mhi: Add inbound buffers allocation flag" · 0dc3ad3f

Greg Kroah-Hartman authored Aug 27, 2021

This reverts commit 0092a1e3

This should be reverted in the char-misc-next branch to make merging
with Linus's branch possible due to issues with the mhi code that was
found in the networking tree.

Link: https://lore.kernel.org/r/20210827175852.GB15018@thinkpadReported-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bhaumik Bhatt <bbhatt@codeaurora.org>
Cc: Hemant Kumar <hemantk@codeaurora.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Loic Poulain <loic.poulain@linaro.org>
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

0dc3ad3f

misc/pvpanic: fix set driver data · a99009bc

Mihai Carabas authored Aug 19, 2021

Add again dev_set_drvdata(), but this time in devm_pvpanic_probe(), in order
for dev_get_drvdata() to not return NULL.

Fixes: 394febc9 ("misc/pvpanic: Make 'pvpanic_probe()' resource managed")
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Link: https://lore.kernel.org/r/1629385946-4584-2-git-send-email-mihai.carabas@oracle.comSigned-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

a99009bc

VMCI: fix NULL pointer dereference when unmapping queue pair · a30dc6cf

Wang Hai authored Aug 18, 2021

I got a NULL pointer dereference report when doing fuzz test:

Call Trace:
  qp_release_pages+0xae/0x130
  qp_host_unregister_user_memory.isra.25+0x2d/0x80
  vmci_qp_broker_unmap+0x191/0x320
  ? vmci_host_do_alloc_queuepair.isra.9+0x1c0/0x1c0
  vmci_host_unlocked_ioctl+0x59f/0xd50
  ? do_vfs_ioctl+0x14b/0xa10
  ? tomoyo_file_ioctl+0x28/0x30
  ? vmci_host_do_alloc_queuepair.isra.9+0x1c0/0x1c0
  __x64_sys_ioctl+0xea/0x120
  do_syscall_64+0x34/0xb0
  entry_SYSCALL_64_after_hwframe+0x44/0xae

When a queue pair is created by the following call, it will not
register the user memory if the page_store is NULL, and the
entry->state will be set to VMCIQPB_CREATED_NO_MEM.

vmci_host_unlocked_ioctl
  vmci_host_do_alloc_queuepair
    vmci_qp_broker_alloc
      qp_broker_alloc
        qp_broker_create // set entry->state = VMCIQPB_CREATED_NO_MEM;

When unmapping this queue pair, qp_host_unregister_user_memory() will
be called to unregister the non-existent user memory, which will
result in a null pointer reference. It will also change
VMCIQPB_CREATED_NO_MEM to VMCIQPB_CREATED_MEM, which should not be
present in this operation.

Only when the qp broker has mem, it can unregister the user
memory when unmapping the qp broker.

Only when the qp broker has no mem, it can register the user
memory when mapping the qp broker.

Fixes: 06164d2b ("VMCI: queue pairs implementation.")
Cc: stable <stable@vger.kernel.org>
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20210818124845.488312-1-wanghai38@huawei.comSigned-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

a30dc6cf

char: mware: fix returnvar.cocci warnings · f8cefead

jing yangyang authored Aug 19, 2021

Remove unneeded variables when "0" can be returned.

Generated by: scripts/coccinelle/misc/returnvar.cocci
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: jing yangyang <jing.yangyang@zte.com.cn>
Link: https://lore.kernel.org/r/20210820021752.10927-1-jing.yangyang@zte.com.cnSigned-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

f8cefead