- 18 Oct, 2021 40 commits
-
-
Omer Shpigelman authored
No need to check the return value if the following action is the same for both cases. In addition, now that hl_ctx_free() doesn't print if the context is not released, its name can be misleading as the context might stay alive after it is executed with no indication for that. Hence we can discard it and simply put the refcount. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Yuri Nudelman authored
Remove the flag that determines whether to take a timestamp once the interrupt arrives. Instead, always take the timestamp once per interrupt. This is a must for the user-space to measure its graph operations to evaluate the graph computation time. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Moti Haimovski authored
When adding a new node to the hpriv list, the driver should initialize its fields before adding the new node. Otherwise, there may be some small chance of another thread traversing that list and accessing the new node's fields without them being initialized. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Rajaravi Krishna Katta authored
Make the frequency set/get functionality common to all ASICs. This makes more code reusable when adding support for newer ASICs. Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Vegard Nossum authored
Fix the following build/link error by adding a dependency on the CRC32 routines: ld: drivers/misc/habanalabs/common/firmware_if.o: in function `hl_fw_dynamic_request_descriptor': firmware_if.c:(.text.unlikely+0xc89): undefined reference to `crc32_le' Fixes: 8a43c83f ("habanalabs: load boot fit to device") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Tomer Tayar authored
Implement the calls to the dma-buf kernel api to create a dma-buf object backed by FD. We block the option to mmap the DMA-BUF object because we don't support DIRECT_IO and implicit P2P. We only implement support for explicit P2P through importing the FD of the DMA-BUF. In the export phase, we provide to the DMA-BUF object an array of pages that represent the device's memory area. During the map callback, we convert the array of pages into an SGT. We split/merge the pages according to the dma max segment size of the importer. To get the DMA address of the PCI bar, we use the dma_map_resources() kernel API, because our device memory is not backed by page struct and this API doesn't need page struct to map the physical address to a DMA address. We set the orig_nents member of the SGT to be 0, to indicate to other drivers that we don't support CPU mappings. Note that in Habanalabs's ASICs, the device memory is pinned and immutable. Therefore, there is no need for dynamic mappings and pinning callbacks. Also note that in GAUDI we don't have an MMU towards the device memory and the user works on physical addresses. Therefore, the user doesn't pass through the kernel driver to allocate memory there. As a result, only for GAUDI we receive from the user a device memory physical address (instead of a handle) and a size. We check the p2p distance using pci_p2pdma_distance_many() and refusing to map dmabuf in case the distance doesn't allow p2p. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Gal Pressman <galpress@amazon.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
User process might want to share the device memory with another driver/device, and to allow it to access it over PCIe (P2P). To enable this, we utilize the dma-buf mechanism and add a dma-buf exporter support, so the other driver can import the device memory and access it. The device memory is allocated using our existing allocation uAPI, where the user will get a handle that represents the allocation. The user will then need to call the new uAPI (HL_MEM_OP_EXPORT_DMABUF_FD) and give the handle as a parameter. The driver will return a FD that represents the DMA-BUF object that was created to match that allocation. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
-
Dani Liberman authored
When polling fences for multi CS, it is possible that fence is no longer exists (its corresponding CS completed and the fence was deleted) but we still accessing its parameters, causing NULL pointer dereference. Fixed by checking if fence exits before accessing its parameters. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Race condition occurs when CS fence completes and multi CS did not completed yet, while waiting for multi CS ends and returns indication to user that the CS completed. Next wait for multi CS may be triggered by previous multi CS completion without any current CS completed, causing an error. Example scenario : 1. User do multi CS wait for CSs 1 and 2 on master QID 0 2. CS 1 and 2 reached the "cs release" code. The thread of CS 1 completed both the CS and multi CS handling but the completion thread of CS 2 completed the CS but still did not executed complete_multi_cs (note that in CS completion the sequence is to first do complete all for the CS and then another complete all to signal the multi_cs) 3. User received indication that CS 1 and 2 completed (since we check the CS fence and both indicated as completed) and immediately waits on CS 3 and 4, also on master QID 0. 4. Completion thread of CS2 executed complete_multi_cs before completion of CS 3 and 4 and so will trigger the multi CS wait of CSs 3 and 4 as they wait on master QID 0. This will trigger multi CS completion although none of its current CS has been completed. Fixed by adding multi CS complete handling indication for each CS. CS will be marked to the user as completed only if its fence completed and multi CS handling is done. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
In the kernel it is common to use u32 and not uint32_t. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Update the firmware headers to the latest version Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Bharat Jauhari authored
There may be a situation where drivers receives continuous fatal H/W error events from FW immediately post reset cycle. This may be due to some fault on the silicon itself. In such case its better to bypass reset cycle so we won't be stuck in endless loop of resets. This commit bypasses reset request in case driver received two back to back FW fatal error before first occurrence of heartbeat event. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Yuri Nudelman authored
Taking an accurate timestamp in a close proximity of the interrupt is required for user side statistics management. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
The driver allows only a single process to open a device's FD at any single time. This is done by checking "hdev->compute_ctx" under mutex. Therefore, to prevent a race between the moment a user closes it's FD and when another user tries to open the device, we need to make sure that clearing this variable is the very last thing that is done in the code of the FD's release. I'm moving the idle check before clearing this variable and the "reset on device release". btw, if the reset happens it will prevent any other user from opening the device until the reset is finished. An important thing to note is that we need to remove the user process that is closing the device from the process list BEFORE calling the reset function. That is to prevent a case where the reset code will try to kill that user process and it is unnecessary as the process doesn't hold any device/driver resources anymore. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Reset to the device is not necessarily due to an error, so print it as info instead of error. In addition, print the type of reset we are doing: - reset of the entire device (aka hard reset) - reset of the device after user have released it (less than hard reset) - lighter reset of an inference device (aka soft reset) Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Soft-reset is the procedure where we reset only the compute/DMA engines of the device, without requiring the current user-space process to release the device. This type of reset can happen if TDR event occurred (a workload got stuck) or by a root request through sysfs. This is only relevant for inference ASICs, as there is no real-world use-case to do that in training, because training runs on multiple devices. In addition, we also do (in certain ASICs) a reset upon device release. That reset uses the same code as the soft-reset. Therefore, to better differentiate between the two resets, it is better to rename the soft-reset support as "inference soft-reset", to make the code more self-explanatory. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Yuri Nudelman authored
The translation in debugfs of device memory MMU virtual addresses was wrong as it did not take into consideration the fact that the page sizes there can be not power of 2. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
In order to avoid user target value wraparound, we modify the current interface so user will be able to wait for an 8-byte target value rather than a 4-byte value. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
During TDR handling, we check multiple times if CS is valid. No need to perform this check as CS must be valid at all time during the TDR handling. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Rajaravi Krishna Katta authored
Add support to retrieve following power info via HWMON: - instantaneous power value - highest value since last reset - reset the highest place holder Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Alon Mizrahi authored
Instead of having dedicated function per message that we want to send to the firmware in COMMS protocol, have a generic function that we can call to from other parts of the driver Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Rajaravi Krishna Katta authored
Instead of using the Linux kernel HWMON enums definition when communicating with the firmware, use proprietary HWMON based enums i.e. map hwmon.h header enum to cpucp_if.h based enum while. This is needed because the HWMON enums are not forcing backward compatibility and therefore changes can break compatibility between newer driver and older firmware. The driver will check for CPU_BOOT_DEV_STS0_MAP_HWMON_EN bit to validate if f/w supports cpucp->hwmon enum mapping to support older firmware where this mapping won't be available. Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
Command submission timeout is currently determined during driver loading time. As some environments requires this timeout to be modified in runtime, we introduce a new debugfs node that controls the timeout value without the need to reload the driver. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Modify some comments in the uapi file to be in kernel-doc style. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-
Greg Kroah-Hartman authored
We need the char/misc fixes in here for merging and testing. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Linus Torvalds authored
-
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libataLinus Torvalds authored
Pull libata fixes from Damien Le Moal: "Two fixes for this cycle: - Fix a null pointer dereference in ahci-platform driver (from Hai) - Fix uninitialized variables in pata_legacy driver (from Dan)" * tag 'libata-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: ahci_platform: fix null-ptr-deref in ahci_platform_enable_regulators() pata_legacy: fix a couple uninitialized variable bugs
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull block fixes from Jens Axboe: "Bigger than usual for this point in time, the majority is fixing some issues around BDI lifetimes with the move from the request_queue to the disk in this release. In detail: - Series on draining fs IO for del_gendisk() (Christoph) - NVMe pull request via Christoph: - fix the abort command id (Keith Busch) - nvme: fix per-namespace chardev deletion (Adam Manzanares) - brd locking scope fix (Tetsuo) - BFQ fix (Paolo)" * tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block: block, bfq: reset last_bfqq_created on group change block: warn when putting the final reference on a registered disk brd: reduce the brd_devices_mutex scope kyber: avoid q->disk dereferences in trace points block: keep q_usage_counter in atomic mode after del_gendisk block: drain file system I/O on del_gendisk block: split bio_queue_enter from blk_queue_enter block: factor out a blk_try_enter_queue helper block: call submit_bio_checks under q_usage_counter nvme: fix per-namespace chardev deletion block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs nvme-pci: Fix abort command id
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull io_uring fix from Jens Axboe: "Just a single fix for a wrong condition for grabbing a lock, a regression in this merge window" * tag 'io_uring-5.15-2021-10-17' of git://git.kernel.dk/linux-block: io_uring: fix wrong condition to grab uring lock
-
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds authored
Pull virtio fixes from Michael Tsirkin: "Fixes up some issues in rc5" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost-vdpa: Fix the wrong input in config_cb VDUSE: fix documentation underline warning Revert "virtio-blk: Add validation for block size in config space" vhost_vdpa: unset vq irq before freeing irq virtio: write back F_VERSION_1 before validate
-
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds authored
Pull powerpc fixes from Michael Ellerman: - Fix a bug where guests on P9 with interrupts passed through could get stuck in synchronize_irq(). - Fix a bug in KVM on P8 where secondary threads entering a guest would write outside their allocated stack. - Fix a bug in KVM on P8 where secondary threads could confuse the host offline code and cause the guest or host to crash. Thanks to Cédric Le Goater. * tag 'powerpc-5.15-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest() powerpc/xive: Discard disabled interrupts in get_irqchip_state()
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull objtool fixes from Borislav Petkov: - Update section headers before the respective relocations to not trigger a safety check in elftoolchain's implementation of libelf - Do not add garbage data to the .rela.orc_unwind_ip section * tag 'objtool_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Update section header before relocations objtool: Check for gelf_update_rel[a] failures
-
git://git.kernel.org/pub/scm/linux/kernel/git/ras/rasLinus Torvalds authored
Pull EDAC fix from Borislav Petkov: - Log the "correct" uncorrectable error count in the armada_xp driver * tag 'edac_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/armada-xp: Fix output of uncorrectable error counter
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull perf fix from Borislav Petkov: - Add Sapphire Rapids to the list of CPUs supporting the SMI count MSR * tag 'perf_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/msr: Add Sapphire Rapids CPU support
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull EFI fixes from Borislav Petkov: "Forwarded from Ard Biesheuvel through the tip tree. Ard will send stuff directly in the near future. Low priority fixes but fixes nonetheless: - update stub diagnostic print that is no longer accurate - avoid statically allocated buffer for CPER error record decoding - avoid sleeping on the efi_runtime semaphore when calling the ResetSystem EFI runtime service" * tag 'efi-urgent-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi: Change down_interruptible() in virt_efi_reset_system() to down_trylock() efi/cper: use stack buffer for error record decoding efi/libstub: Simplify "Exiting bootservices" message
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull x86 fixes from Borislav Petkov: - Do not enable AMD memory encryption in Kconfig by default due to shortcomings of some platforms, leading to boot failures. - Mask out invalid bits in the MXCSR for 32-bit kernels again because Thomas and I don't know how to mask out bits properly. Third time's the charm. * tag 'x86_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Mask out the invalid MXCSR bits properly x86/Kconfig: Do not enable AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT automatically
-
Linus Torvalds authored
Merge tag 'driver-core-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some small driver core fixes for 5.15-rc6, all of which have been in linux-next for a while with no reported issues. They include: - kernfs negative dentry bugfix - simple pm bus fixes to resolve reported issues" * tag 'driver-core-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: drivers: bus: Delete CONFIG_SIMPLE_PM_BUS drivers: bus: simple-pm-bus: Add support for probing simple bus only devices driver core: Reject pointless SYNC_STATE_ONLY device links kernfs: don't create a negative dentry if inactive node exists
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-miscLinus Torvalds authored
Pull char/misc driver fixes from Greg KH: "Here are some small char/misc driver fixes for 5.15-rc6 for reported issues that include: - habanalabs driver fixes - mei driver fixes and new ids - fpga new device ids - MAINTAINER file updates for fpga subsystem - spi module id table additions and fixes - fastrpc locking fixes - nvmem driver fix All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: eeprom: 93xx46: fix MODULE_DEVICE_TABLE nvmem: Fix shift-out-of-bound (UBSAN) with byte size cells mei: hbm: drop hbm responses on early shutdown mei: me: add Ice Lake-N device id. eeprom: 93xx46: Add SPI device ID table eeprom: at25: Add SPI ID table misc: HI6421V600_IRQ should depend on HAS_IOMEM misc: fastrpc: Add missing lock before accessing find_vma() cb710: avoid NULL pointer subtraction misc: gehc: Add SPI ID table MAINTAINERS: Drop outdated FPGA Manager website MAINTAINERS: Add Hao and Yilun as maintainers habanalabs: fix resetting args in wait for CS IOCTL fpga: ice40-spi: Add SPI device ID table
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/stagingLinus Torvalds authored
Pull staging and IIO driver fixes from Greg KH: "Here are a number of small IIO and staging driver fixes for 5.15-rc6. They include: - vc04_services bugfix for reported problem - r8188eu array underflow fix - iio driver fixes for a lot of tiny reported issues. All of these have been in linux-next for a while with no reported issues" * tag 'staging-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: r8188eu: prevent array underflow in rtw_hal_update_ra_mask() staging: vc04_services: shut up out-of-range warning iio: light: opt3001: Fixed timeout error when 0 lux iio: adis16480: fix devices that do not support sleep mode iio: mtk-auxadc: fix case IIO_CHAN_INFO_PROCESSED iio: adis16475: fix deadlock on frequency set iio: ssp_sensors: add more range checking in ssp_parse_dataframe() iio: ssp_sensors: fix error code in ssp_print_mcu_debug() iio: adc: ad7793: Fix IRQ flag iio: adc: ad7780: Fix IRQ flag iio: adc: ad7192: Add IRQ flag iio: adc: aspeed: set driver data when adc probe. iio: adc: rzg2l_adc: add missing clk_disable_unprepare() in rzg2l_adc_pm_runtime_resume() iio: adc: max1027: Fix the number of max1X31 channels iio: adc: max1027: Fix wrong shift with 12-bit devices iio: adc128s052: Fix the error handling path of 'adc128_probe()' iio: adc: rzg2l_adc: Fix -EBUSY timeout error return iio: accel: fxls8962af: return IRQ_HANDLED when fifo is flushed iio: dac: ti-dac5571: fix an error code in probe()
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/ttyLinus Torvalds authored
Pull serial driver fix from Greg KH: "Here is a single 8250 Kconfig fix for 5.15-rc6 that resolves a regression that showed up in 5.15-rc1. It has been in linux-next for a while with no reported issues" * tag 'tty-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: serial: 8250: allow disabling of Freescale 16550 compile test
-