- 27 Dec, 2021 3 commits
-
-
Greg Kroah-Hartman authored
Merge tag 'misc-habanalabs-next-2021-12-27' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into char-misc-next Oded writes: This tag contains habanalabs driver changes for v5.17: - Support reset-during-reset. In case the f/w notifies the driver that the f/w is going to reset the device, the driver should support that even if it is in the middle of doing another reset - Support events from f/w that arrive during device resets. These events would be ignored which is bad as critical errors would not be reported and treated by the driver. - Don't kill processes that hold the control device open during hard-reset of the device. The control device operations can't crash if done during hard-reset. And usually, only monitoring applications are using the control device, so killing them defies their purpose. - Fix handling of hwmon nodes when working with legacy f/w - Change the compute context pointer to be boolean. This pointer was abused by multiple code paths that wanted fast access to the compute context structure. - Add uapi to fetch historical errors. This is necessary as errors sometimes result in hard-reset where the user application is being terminated. - Optimize GAUDI's MMU cache invalidation. - Add support for loading the latest f/w. - Add uapi to fetch HBM replacement and pending rows information. - Multiple bug fixes to the reset code. - Multiple bug fixes for Multi-CS ioctl code. - Multiple bug fixes for wait-for-interrupt ioctl code. - Many small bug fixes and cleanups. * tag 'misc-habanalabs-next-2021-12-27' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux: (70 commits) habanalabs: support hard-reset scheduling during soft-reset habanalabs: add a lock to protect multiple reset variables habanalabs: refactor reset information variables habanalabs: handle skip multi-CS if handling not done habanalabs: add CPU-CP packet for engine core ASID cfg habanalabs: replace some -ENOTTY with -EINVAL habanalabs: fix comments according to kernel-doc habanalabs: fix endianness when reading cpld version habanalabs: change wait_for_interrupt implementation habanalabs: prevent wait if CS in multi-CS list completed habanalabs: modify cpu boot status error print habanalabs: clean MMU headers definitions habanalabs: expose soft reset sysfs nodes for inference ASIC habanalabs: sysfs support for two infineon versions habanalabs: keep control device alive during hard reset habanalabs: fix hwmon handling for legacy f/w habanalabs: add current PI value to cpu packets habanalabs: remove in_debug check in device open habanalabs: return correct clock throttling period habanalabs: wait again for multi-CS if no CS completed ...
-
Greg Kroah-Hartman authored
Merge tag 'extcon-next-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into char-misc-next Chanwoo writes: Update extcon next for v5.17 Detailed description for this pull request: 1. Remove duplicate code in extcon_set_state_sync() in extcon core 2. Fix non-kernel-doc comment for extcon-usb-gpio.c * tag 'extcon-next-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon: extcon: Deduplicate code in extcon_set_state_sync() extcon: usb-gpio: fix a non-kernel-doc comment
-
Greg Kroah-Hartman authored
Merge tag 'icc-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc into char-misc-next Georgi writes: interconnect changes for 5.17 Here are the interconnect changes for the 5.17-rc1 merge window consisting of new drivers, minor changes and fixes. New drivers: - New driver for MSM8996 platforms - New driver for SC7280 EPSS L3 hardware - New driver for QCM2290 platforms - New driver for SM8450 platforms Driver changes: - dt-bindings: interconnect: Combine SDM660 bindings into RPM schema - icc-rpm: Add support for bus power domain - icc-rpm: Use NOC_QOS_MODE_INVALID for qos_mode check - icc-rpm: Define ICC device type - icc-rpm: Add QNOC type QoS support - icc-rpm: Support child NoC device probe - icc-rpm: Prevent integer overflow in rate - icc-rpmh: Add BCMs to commit list in pre_aggregate Signed-off-by: Georgi Djakov <djakov@kernel.org> * tag 'icc-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc: interconnect: qcom: Add QCM2290 driver support dt-bindings: interconnect: Add Qualcomm QCM2290 NoC support interconnect: icc-rpm: Support child NoC device probe interconnect: icc-rpm: Add QNOC type QoS support interconnect: icc-rpm: Define ICC device type interconnect: qcom: Add SM8450 interconnect provider driver dt-bindings: interconnect: Add Qualcomm SM8450 DT bindings interconnect: qcom: rpm: Prevent integer overflow in rate interconnect: icc-rpm: Use NOC_QOS_MODE_INVALID for qos_mode check interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate interconnect: qcom: Add MSM8996 interconnect provider driver dt-bindings: interconnect: Add Qualcomm MSM8996 DT bindings interconnect: icc-rpm: Add support for bus power domain dt-bindings: interconnect: Combine SDM660 bindings into RPM schema interconnect: qcom: Add EPSS L3 support on SC7280 dt-bindings: interconnect: Add EPSS L3 DT binding on SC7280
-
- 26 Dec, 2021 37 commits
-
-
Ofir Bitton authored
As hard-reset can be requested during soft-reset, driver must allow it or else critical events received during soft-reset will be ignored. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Atomic operations during reset are replaced by a spinlock in order to have the ability to protect more than a single variable. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Unify variables related to device reset, which will help us to add some new reset functionality in future patches. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
This patch fixes issue in which we have timeout for multi-CS although the CS in the list actually completed. Example scenario (the two threads marked as WAIT for the thread that handles the wait_for_multi_cs and CMPL as the thread that signal completion for both CS and multi-CS): 1. Submit CS with sequence X 2. [WAIT]: call wait_for_multi_cs with single CS X 3. [CMPL]: CS X do invoke complete_all for both CS and multi-CS (multi_cs_completion_done still false) 4. [WAIT]: enter poll_fences, reinit the completion and find the CS as completed when asking on the fence but multi_cs_done is still false it returns that no CS actually completed 5. [CMPL]: set multi_cs_handling_done as true 6. [WAIT]: wait for completion but no CS to awake the wait context and hence wait till timeout Solution: if CS detected as completed in poll_fences but multi_cs_done is still false invoke complete_all to the multi-CS completion and so it will not go to sleep in wait_for_completion but rather will have a "second chance" to wait for multi_cs_completion_done. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
In some cases the driver cannot configure ASID of some engines due to the security level of the relevant registers. For this a new CPU-CP packet is introduced, which will allow the driver to ask the F/W to do this configuration instead. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
-ENOTTY is returned in case of error in the ioctl arguments themselves, such as function that doesn't exists. In all other cases, where the error is in the arguments of the custom data structures that we define that are passed in the various ioctls, we need to return -EINVAL. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Fix missing fields, descriptions not according to kernel-doc style. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Current sysfs implementation does not take endianness into consideration when dumping the cpld version. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
farah kassabri authored
Currently the cq counters are allocated in userspace memory, and mapped by the driver to the device address space. A new requirement that is part of new future API related to this one, requires that cq counters will be allocated in kernel memory. We leverage the existing cb_create API with KERNEL_MAPPED flag set to allocate this memory. That way we gain two things: 1. The memory cannot be freed while in use since it's protected by refcount in driver. 2. No need to wake up the user thread upon each interrupt from CQ, because the kernel has direct access to the counter. Therefore, it can make comparison with the target value in the interrupt handler and wake up the user thread only if the counter reaches the target value. This is instead of waking the thread up to copy counter value from user then go sleep again if target value wasn't reached. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
By the original design we assumed that if we "miss" multi CS completion it is of no severe consequence as we'll just call wait_for_multi_cs again. Sequence of events for such scenario: 1. user submit CS with sequence N 2. user calls wait for multi-CS with only CS #N in the list 3. the multi CS call starts with poll of the CSs but find that none completed (while CS #N did not completed yet) 4. now, multi CS #N complete but multi CS CTX was not yet created for the above multi-CS. so, attempt to complete multi-CS fails (as no multi CS CTX exist) 5. wait_for_multi_cs call now does init_wait_multi_cs_completion (and for this create the multi-CS CTX) 6. wait_for_multi_cs wits on completion but will not get one as CS #N already completed To fix the issue we initialize the multi-CS CTX prior polling the fences. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
As BTL can be replaced by ROM we should modify relevant error print. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
During the MMU development the MMU header files were left with unclean definitions: - MMU "version specific" definitions that were left in the mmu_general file - unused definitions This patch attempts, where possible, to keep definitions that can serve multiple MMU versions (but that are not tightly bound with specific MMU arch) in the mmu_general header file (e.g. different definitions for number of HOPs). Otherwise, move MMU version specific definitions (e.g. HOPs masks and shifts) to the specific MMU version file. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
As we allow soft-reset to be performed only on inference devices, having the sysfs nodes may cause a confusion. Hence, we remove those nodes on training ASICs. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Currently sysfs support dumping a single infineon version, in future asics we will have two infineon versions. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Need to allow user retrieve data during reset and afterwards without the need to reopen the device. Did it by seperating the user peocesses list into two lists: 1. fpriv_list which contains list of user processes that opened the device (currently only one). 2. fpriv_ctrl_list which contains list of user processes that opened the control device. This processes in this list shall not be killed during reset, only when the device is suddenly removed from PCI chain. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
In legacy f/w that use old hwmon.h file, the values of the hwmon enums are different than the values that are in newer kernels (5.6 and above). Therefore, to support working with those f/w, we need to do some fixup before registering with the hwmon subsystem and also when calling the functions that communicate with the f/w to retrieve sensors information. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
In order to increase cpucp messaging reliability we will add the current PI value to the descriptor sent to F/W. F/W will wait for the PI value as an indication of a valid packet. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
The driver supports only a single user anyway, so there is no point in checking whether we are in_debug state when a user tries to open the device, because if we are in_debug, it means a user is already using the device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Current clock throttling period returned from driver was wrong due to wrong time comparison. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
The original multi-CS design assumption that stream masters are used exclusively (i.e. multi-CS with set of stream master QIDs will not get completed by CS not from the multi-CS set) is inaccurate. Thus multi-CS behavior is now modified not to treat such case as an error. Instead, if we have multi-CS completion but we detect that no CS from the list is actually completed we will do another multi-CS wait (with modified timeout). Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
It was an error to save the compute context's pointer in the device structure, as it allowed its use without proper ref-cnt. Change the variable to a flag that only indicates whether there is an active compute context. Code that needs the pointer will now be forced to use proper internal APIs to get the pointer. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
There are multiple places where the code needs to get the context's pointer and increment its ref cnt. This is the proper way instead of using the compute context pointer in the device structure. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Pass the user's context pointer into the etr configuration function to extract its ASID. Using the compute_ctx pointer is an error as it is just an indication of whether a user has opened the compute device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Compute context pointer in hdev shouldn't be used for fetching the context's pointer. If an object needs the context's pointer, it should get it while incrementing its kref, and when the object is released, put it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
The driver supports only a single context. Therefore, no need to check if the user context that is closed is the compute context. The user context, if exists, is always the compute context. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Fix a bug where in case of failure to allocate idr, the handle's memory wasn't freed as part of the error handling code. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Add missing kernel-doc comments for the "last_error" and "stream_master_qid_arr" fields of the "hl_device" structure". Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
The reset flags used by the reset thread are currently a mix of hard-coded values and a specific flag which is passed from the context that initiates the reset. To make it easier to pass more flags in future from this context to the reset thread, modify it to pass all the original reset flags to the thread. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Because info ioctl is used to retrieve data, some of its opcodes may be used during hard reset. Other ioctls should be blocked while device is not operational. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
For debug purpose, add SOB address and SOB initial counter value before current submission to uAPI output. Using SOB address and initial counter, user can calculate how much of the submmision has been completed. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
Reporting FW errors involves reading of the error registers. In case we have a corrupted FW descriptor we cannot do that since the dynamic scratchpad is potentially corrupted as well and may cause kernel crush when attempting access to a corrupted register offset. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Driver should handle events during soft-reset as F/W is not going through reset and it keeps sending events towards host. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Currently we dump the physical IRQ line index in host if an event is received during reset. This ID is confusing as it means nothing to the user. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
In new f/w versions, it is required to explicitly indicate the power information type when querying the F/W for power info. When getting the current power level it should be set to power_input. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Some info ioctls can be served even if the device is disabled or in reset. Hence, we enable more info ioctls during reset, as these ioctls do not require any H/W nor F/W communication. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Race example scenario: 1. User have 2 threads that waits on multi CS: - thread_0 waits on QID 0 and uses multi CS context 0. - thread_1 waits on QID 1 and uses multi CS context 1. 2. thread_1 got completion and release multi CS context 1. 3. CS related to multi CS of thread_0 starts executing complete_multi_cs function, the first iteration of the loop completes the multi CS of thread_0, hence multi CS context 0 is released. 4. thread_1 waits on QID 1 and uses multi CS context 0. 5. thread_0 waits on QID 0 and uses multi CS context 1. 6. The second iterattion of the loop (from step 3) starts, which means, start checking multi CS context 1: - multi CS contetxt is being used by thread_0 waiting on QID 0. - The fence of the CS (still CS from step 3) has QID map the same as the multi CS context 1. - multi CS context 1 (thread_0) gets completion on CS that triggered already thread_0 (with multi CS context 0) and is no longer being waited on. Fixed by exiting the loop in complete_multi_cs after getting completion Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
As device boot warnings clears the indication from the error mask, they must be located together before the unknown error validation. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-