- 23 Nov, 2022 38 commits
-
-
Ofir Bitton authored
Due to binning, Gaudi2 does not always support fp32. We add support for such an event in case fp32 is used by the user in such a device. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Each time page fault happens, besides capturing its data, also notify the user about it. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Creating event queue workqueue using alloc_workqueue made it run in multi threaded mode, which caused parallel dumping of events as well as parallel events notifying to user, causing logs with multiple events to be out of order. Fixed by creating event queue workqueue as single threaded work queue. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Each time razwi (read-only zero, write ignore) happens, besides capturing its data, also notify the user about it. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Add support for Gaudi2 Device with PCI revision 2. Functionality is exactly the same as revision 1, the only difference is device name exposed to user. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
As Gaudi2 has a single PCI id, the secured asic type is redundant. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
In order to know if driver catches PCI errors correctly, we need to print a warning per each error. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
hl_access_sram_dram_region() uses a region base which is set within the hl_set_dram_bar() function. However, for SRAM access this function is not called, and we end up with a wrong value of region base and with a bad calculated address. Fix it by initializing the region base value independently of whether hl_set_dram_bar() is called or not. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
farah kassabri authored
To avoid memory corruption in kernel memory while using timestamp registration nodes, zero the kernel buff memory when its allocated. Signed-off-by:
farah kassabri <fkassabri@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tal Cohen authored
Consecutive error protects a device reset loop from being triggered due to h/w issues and enters the device into an unavailable state. When user may cause the error, an unavailable state will prevent the user from running its workloads. The commit prevents entering consecutive state when a user context is enabled. Signed-off-by:
Tal Cohen <talcohen@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Use graceful hard reset when detecting a CS timeout that requires a device reset. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Use graceful hard reset for F/W events on Gaudi2 device that require a device reset. While at it, do a small refactor of the checks and function calls, to simplify it and to avoid code duplication. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Use graceful hard reset for F/W events on Gaudi device that require a device reset. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Add an option to control the timeout value for the driver's watchdog of the reset process. The timeout represents the amount of the user has to close his process once he gets a device reset notification from the driver. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Calling hl_device_reset() for a hard reset will lead to a quite immediate device reset and to killing user process. For resets that follow errors, it disables the option to debug the errors on both the device side and the user application side. This patch adds a 'graceful hard reset' option and a new hl_device_cond_reset() function. Under some conditions, mainly if there is no user process or if he is not registered to driver notifications, this function will execute hard reset as usual. Otherwise, the reset will be postponed and a notification will be sent to user, to let him perform post-error actions and then to release the device, after which reset will take place. If device is not released by user in some defined time, a watchdog work will execute the reset in any case. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
Currently there is no verification whether the divisor is legal. Signed-off-by:
Ohad Sharabi <osharabi@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
As there are 2 types of user mappings, pmmu and hmmu, calculate only the relevant mappings for the requested type. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
The virtual MSI-X doorbell is supported now in F/W, so all configurations to access the PCIE_DBI MSI-X doorbell can be removed. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
Up until now the use-case in the driver was that the HBM is accessed using the HBM BAR, yet the BAR sometimes cannot cover the whole HBM and so we needed to set the BAR to other HBM offset. Now we are facing the need to access other PCI memory regions that can be covered by the HBM BAR. To answer that we are allowing the caller to determine if the HBM BAR need to be set or not regardless of the PCI memory region. Signed-off-by:
Ohad Sharabi <osharabi@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
The code uses the pointer for trace purpose (without actually dereference it) but still get static analysis warning. This patch eliminate the warning. Signed-off-by:
Ohad Sharabi <osharabi@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dilip Puri authored
NIC ARCs need to have access to CBU_EARLY_BRESP, hence we unsecure those registers. Signed-off-by:
Dilip Puri <dilipp@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tal Cohen authored
The event notifier mechanism should not raise an empty event (event equals zero). Signed-off-by:
Tal Cohen <talcohen@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Capture page fault data when it happens. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Added function to calculate possible engines which caused RAZWI (read-only zero, write ignored), from a given router id or module index. When getting RAZWI via PSOC IP, first the router id is calculated and then the possible engines that caused the RAZWI are calculated. There is a possibility that the RAZWI initiator is not an engine. In that case, it will not be included in possible engines as it doesn't have an engine id. RAZWI information is captured when receiving event from engine or via PSOC IP. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
In case of HBM MMU page fault, capture its relevant mappings. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
'struct hl_device_reset_work' is used as a wrapper for the reset work and its parameters, including the reset workqueue on which it runs. In a future commit, another reset related work with similar parameters is going to be added, but it won't use the reset workqueue. As in any case there is a single reset workqueue, and to allow the resue of this structure, move the reset workqueue to 'struct hl_device'. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Unregistering eventfd is for releasing host resources and doesn't involve an access to the device. As such, there is no reason to disallow it when device isn't operational. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
If reset upon device release is enabled, there is no need to check the device idle status in hpriv_release(), because device is going to be reset in any case. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tal Cohen authored
Device unavailable notifies the user that there isn't an option to retrieve debug information from the device. When a critical device error occurs and the f/w performs the device reset, a device unavailable notification shall be sent to the user process. Signed-off-by:
Tal Cohen <talcohen@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Koby Elbaz authored
Privileged MME clock configuration is removed as it is done by the f/w. Signed-off-by:
Koby Elbaz <kelbaz@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dafna Hirschfeld authored
pf was an abbreviation for prefetch but because pf already stands for 'physical function', we decided to change it to 'prefetch'. Signed-off-by:
Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Only the first page fault will be saved. Besides the address which caused the page fault, the driver captures all of the mmu user mappings. User can retrieve this data via the new uapi (new opcode in INFO ioctl). Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
RAZWI is optionally handled as part of the generic QM SEI error handling, but it always uses PDMA as the module ID. Fix it to use the suitable module ID according to the specific event. Signed-off-by:
Tomer Tayar <ttayar@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Bharat Jauhari authored
This fixes sparse warning on doing cast to 32-bits Signed-off-by:
Bharat Jauhari <bjauhari@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
This event notification was compatible only with gaudi, where razwi and page fault happens together. To make it compatible with all ASICs, this refactor contains: 1. Razwi notification will only notify about razwi info. New notification will be added in future patch, to retrieve data about page fault error. 2. Changed razwi info structure to support all ASICs. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Use the simplified API that calculates distance between two devices. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Monitoring apps would like to query device state at any time so we should allow it also during reset because it doesn't involve accessing the h/w. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Yang Yingliang authored
If hl_cpu_accessible_dma_pool_alloc() fails, we should check 'req_cpu_addr', fix it. Fixes: 0c88760f ("habanalabs/gaudi2: add secured attestation info uapi") Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 21 Nov, 2022 2 commits
-
-
Greg Kroah-Hartman authored
We need the char/misc fixes in here as well. Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Linus Torvalds authored
-