Commits · 9c604af0c9d4efe4f308761229186768b3f3a6a9 · Kirill Smelkov / linux

23 Nov, 2022 40 commits

habanalabs/gaudi2: return to reset upon SM SEI BRESP error · 9c604af0

Tomer Tayar authored Oct 20, 2022

Due to a H/W issue in the LBW path to the PCIE_DBI MSI-X doorbell, there
were false sporadic error responses in SM when it was configured to
write to there, and hence no reset was done as part of handling the
relevant event.
Now that the virtual MSI-X doorbell is used, such errors in SM are not
expected and reset shouldn't be skipped.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

9c604af0

habanalabs/gaudi2: don't enable entries in the MSIX_GW table · 2c77ec14

Tomer Tayar authored Oct 20, 2022

User should use the virtual MSI-X doorbell to generate interrupts from
the device, so there is no need to enable entries in the MSIX_GW table.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

2c77ec14

habanalabs/gaudi2: remove redundant firmware version check · 24c983c8

farah kassabri authored Nov 08, 2022

Firmware 1.7 is the first official firmware, so no need to check
if we are running a version below it.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

24c983c8

habanalabs/gaudi: fix print for firmware-alive event · fe3e88c9

Tomer Tayar authored Nov 07, 2022

Add missing le{32,64}_to_cpu conversions.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

fe3e88c9

habanalabs: fix print for out-of-sync and pkt-failure events · 5f8981d6

Tomer Tayar authored Nov 07, 2022

Add missing le32_to_cpu() conversions, and use %d for the value
returned from atomic_read().
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

5f8981d6

habanalabs/gaudi2: add page fault notify event · d3027f4a

Dani Liberman authored Oct 31, 2022

Each time page fault happens, besides capturing its data, also notify
the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

d3027f4a

habanalabs/gaudi2: classify power/thermal events as info · a63de89b

Ofir Bitton authored Nov 06, 2022

As power and thermal envelope events are pure informative and not
indicating an error, we reduce the print level to info only.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

a63de89b

habanalabs: skip events info ioctl if not supported · b829e010

Ohad Sharabi authored Nov 06, 2022

Some ASICs haven't yet implemented this functionality and so the
ioctl call should fail and the user should be notified of the reason.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

b829e010

habanalabs: fix firmware descriptor copy operation · 3daa64ee

farah kassabri authored Sep 22, 2022

This is needed to allow adding more data to the lkd_fw_comms_desc
structure.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

3daa64ee

habanalabs/gaudi2: add razwi notify event · 413bdb17

Dani Liberman authored Oct 30, 2022

Each time razwi (read-only zero, write ignored) event happens, besides
capturing its data, also notify the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

413bdb17

habanalabs/gaudi2: implement fp32 not supported event · 91bd8224

Ofir Bitton authored Oct 30, 2022

Due to binning, Gaudi2 does not always support fp32.
We add support for such an event in case fp32 is used by the user
in such a device.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

91bd8224

habanalabs/gaudi: add page fault notify event · aff6354a

Dani Liberman authored Oct 31, 2022

Each time page fault happens, besides capturing its data, also notify
the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

aff6354a

habanalabs: use single threaded WQ for event handling · cd21701c

Dani Liberman authored Oct 27, 2022

Creating event queue workqueue using alloc_workqueue made it run in
multi threaded mode, which caused parallel dumping of events as well as
parallel events notifying to user, causing logs with multiple
events to be out of order.

Fixed by creating event queue workqueue as single threaded work queue.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

cd21701c

habanalabs/gaudi: add razwi notify event · cb5fb665

Dani Liberman authored Oct 30, 2022

Each time razwi (read-only zero, write ignore) happens, besides
capturing its data, also notify the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

cb5fb665

habanalabs/gaudi2: add PCI revision 2 support · 841cd2d7

Ofir Bitton authored Oct 26, 2022

Add support for Gaudi2 Device with PCI revision 2.
Functionality is exactly the same as revision 1, the only difference
is device name exposed to user.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

841cd2d7

habanalabs: remove redundant gaudi2_sec asic type · 30620698

Ofir Bitton authored Oct 26, 2022

As Gaudi2 has a single PCI id, the secured asic type is redundant.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

30620698

habanalabs: add warning print upon a PCI error · bdfef91e

Ofir Bitton authored Oct 19, 2022

In order to know if driver catches PCI errors correctly, we need to
print a warning per each error.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

bdfef91e

habanalabs: fix PCIe access to SRAM via debugfs · fc69aa86

Tomer Tayar authored Oct 24, 2022

hl_access_sram_dram_region() uses a region base which is set within the
hl_set_dram_bar() function. However, for SRAM access this function is
not called, and we end up with a wrong value of region base and with a
bad calculated address.
Fix it by initializing the region base value independently of whether
hl_set_dram_bar() is called or not.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

fc69aa86

habanalabs: zero ts registration buff when allocated · 679e9689

farah kassabri authored Sep 20, 2022

To avoid memory corruption in kernel memory while using timestamp
registration nodes, zero the kernel buff memory when its allocated.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

679e9689

habanalabs: no consecutive err when user context is enabled · 4a9c6e2c

Tal Cohen authored Oct 18, 2022

Consecutive error protects a device reset loop from being triggered
due to h/w issues and enters the device into an unavailable state.
When user may cause the error, an unavailable state
will prevent the user from running its workloads.

The commit prevents entering consecutive state when a user context
is enabled.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

4a9c6e2c

habanalabs: use graceful hard reset for CS timeouts · 1b363adc

Tomer Tayar authored Sep 30, 2022

Use graceful hard reset when detecting a CS timeout that requires a
device reset.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

1b363adc

habanalabs/gaudi2: use graceful hard reset for F/W events · d1ce7e5e

Tomer Tayar authored Sep 30, 2022

Use graceful hard reset for F/W events on Gaudi2 device that require a
device reset.

While at it, do a small refactor of the checks and function calls,
to simplify it and to avoid code duplication.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

d1ce7e5e

habanalabs/gaudi: use graceful hard reset for F/W events · 5b8873b3

Tomer Tayar authored Sep 30, 2022

Use graceful hard reset for F/W events on Gaudi device that require a
device reset.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

5b8873b3

habanalabs: add an option to control watchdog timeout via debugfs · 11669b58

Tomer Tayar authored Sep 30, 2022

Add an option to control the timeout value for the driver's watchdog
of the reset process. The timeout represents the amount of the user
has to close his process once he gets a device reset notification from
the driver.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

11669b58

habanalabs: add support for graceful hard reset · a88a6f5f

Tomer Tayar authored Sep 30, 2022

Calling hl_device_reset() for a hard reset will lead to a quite
immediate device reset and to killing user process.
For resets that follow errors, it disables the option to debug the
errors on both the device side and the user application side.

This patch adds a 'graceful hard reset' option and a new
hl_device_cond_reset() function.
Under some conditions, mainly if there is no user process or if he is
not registered to driver notifications, this function will execute hard
reset as usual.
Otherwise, the reset will be postponed and a notification will be sent
to user, to let him perform post-error actions and then to release the
device, after which reset will take place.

If device is not released by user in some defined time, a watchdog work
will execute the reset in any case.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

a88a6f5f

habanalabs: avoid divide by zero in device utilization · d1e0ac37

Ohad Sharabi authored Oct 23, 2022

Currently there is no verification whether the divisor is legal.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

d1e0ac37

habanalabs: fix user mappings calculation in case of page fault · 6bcb2d05

Dani Liberman authored Oct 19, 2022

As there are 2 types of user mappings, pmmu and hmmu, calculate
only the relevant mappings for the requested type.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

6bcb2d05

habanalabs/gaudi2: remove configurations to access the MSI-X doorbell · 5ad06bb1

Tomer Tayar authored Oct 20, 2022

The virtual MSI-X doorbell is supported now in F/W, so all
configurations to access the PCIE_DBI MSI-X doorbell can be removed.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

5ad06bb1

habanalabs: allow setting HBM BAR to other regions · e325d5db

Ohad Sharabi authored Sep 14, 2022

Up until now the use-case in the driver was that the HBM is accessed
using the HBM BAR, yet the BAR sometimes cannot cover the whole HBM and
so we needed to set the BAR to other HBM offset.
Now we are facing the need to access other PCI memory regions that can
be covered by the HBM BAR.
To answer that we are allowing the caller to determine if the HBM BAR
need to be set or not regardless of the PCI memory region.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

e325d5db

habanalabs: fix using freed pointer · 24fdfb35

Ohad Sharabi authored Oct 18, 2022

The code uses the pointer for trace purpose (without actually
dereference it) but still get static analysis warning.
This patch eliminate the warning.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

24fdfb35

habanalabs/gaudi2: unsecure CBU_EARLY_BRESP registers · dc8d243c

Dilip Puri authored Oct 12, 2022

NIC ARCs need to have access to CBU_EARLY_BRESP, hence we unsecure
those registers.
Signed-off-by: Dilip Puri <dilipp@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

dc8d243c

habanalabs: verify no zero event is sent · 27cd39af

Tal Cohen authored Oct 03, 2022

The event notifier mechanism should not raise an empty
event (event equals zero).
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

27cd39af

habanalabs/gaudi2: capture page fault data · 4f11694f

Dani Liberman authored Sep 29, 2022

Capture page fault data when it happens.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

4f11694f

habanalabs/gaudi2: capture RAZWI information · 15ac503c

Dani Liberman authored Sep 28, 2022

Added function to calculate possible engines which caused
RAZWI (read-only zero, write ignored), from a given router id or
module index.

When getting RAZWI via PSOC IP, first the router id is calculated
and then the possible engines that caused the RAZWI are calculated.

There is a possibility that the RAZWI initiator is not an engine. In
that case, it will not be included in possible engines as it
doesn't have an engine id.

RAZWI information is captured when receiving event from engine or via
PSOC IP.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

15ac503c

habanalabs: handle HBM MMU when capturing page fault data · 17f3f42a

Dani Liberman authored Sep 29, 2022

In case of HBM MMU page fault, capture its relevant mappings.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

17f3f42a

habanalabs: move reset workqueue to be under hl_device · 1eebb259

Tomer Tayar authored Sep 30, 2022

'struct hl_device_reset_work' is used as a wrapper for the reset work
and its parameters, including the reset workqueue on which it runs.
In a future commit, another reset related work with similar parameters
is going to be added, but it won't use the reset workqueue.

As in any case there is a single reset workqueue, and to allow the resue
of this structure, move the reset workqueue to 'struct hl_device'.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

1eebb259

habanalabs: allow unregistering eventfd when device non-operational · 51236cd9

Tomer Tayar authored Sep 30, 2022

Unregistering eventfd is for releasing host resources and doesn't
involve an access to the device. As such, there is no reason to disallow
it when device isn't operational.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

51236cd9

habanalabs: skip idle status check if reset on device release · 3a83ebc5

Tomer Tayar authored Sep 30, 2022

If reset upon device release is enabled, there is no need to check the
device idle status in hpriv_release(), because device is going to be
reset in any case.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

3a83ebc5

habanalabs/gaudi2: add device unavailable notification · 5731b6e6

Tal Cohen authored Sep 28, 2022

Device unavailable notifies the user that there isn't an option to
retrieve debug information from the device.
When a critical device error occurs and the f/w performs the device
reset, a device unavailable notification shall be sent to the user
process.
Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

5731b6e6

habanalabs/gaudi2: remove privileged MME clock configuration · 16448d64

Koby Elbaz authored Sep 28, 2022

Privileged MME clock configuration is removed as it is done by the f/w.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

16448d64