drm/amdgpu: update eeprom once specifying one bigger threshold(v3)

During driver's probe, when it hits bad gpu tag in eeprom i2c init calling(the tag was set when reported bad page reaches bad page threshold in last driver's working loop), there are some strategys to deal with the cases: 1. when the module parameter amdgpu_bad_page_threshold = 0, that means page retirement feature is disabled, so just resetting the eeprom is fine. 2. When amdgpu_bad_page_threshold is not 0, and moreover, user sets one bigger valid data in order to make current boot up succeeds, correct eeprom header tag and do not break booting. 3. For other cases, driver's probe will be broken. v2: Just update eeprom header tag instead of resetting the whole table header when user sets one bigger threshold data. v3: Use dev_info/dev_err to print PCI device information, which helps in mGPU case. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: update eeprom once specifying one bigger threshold(v3)
During driver's probe, when it hits bad gpu tag in eeprom i2c init calling(the tag was set when reported bad page reaches bad page threshold in last driver's working loop), there are some strategys to deal with the cases: 1. when the module parameter amdgpu_bad_page_threshold = 0, that means page retirement feature is disabled, so just resetting the eeprom is fine. 2. When amdgpu_bad_page_threshold is not 0, and moreover, user sets one bigger valid data in order to make current boot up succeeds, correct eeprom header tag and do not break booting. 3. For other cases, driver's probe will be broken. v2: Just update eeprom header tag instead of resetting the whole table header when user sets one bigger threshold data. v3: Use dev_info/dev_err to print PCI device information, which helps in mGPU case. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
9b856def · Guchun Chen · Alex Deucher · a219ecbb · 9b856def
Commit 9b856def authored Jul 27, 2020 by Guchun Chen Committed by Alex Deucher Aug 04, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 28 additions and 2 deletions

drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c +28 -2

No files found.
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -216,6 +216,24 @@ static bool __validate_tbl_checksum(struct amdgpu_ras_eeprom_control *control,
 	return true;
 }

+static int amdgpu_ras_eeprom_correct_header_tag(
+				struct amdgpu_ras_eeprom_control *control,
+				uint32_t header)
+{
+	unsigned char buff[EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE];
+	struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
+	int ret = 0;
+
+	memset(buff, 0, EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE);
+
+	mutex_lock(&control->tbl_mutex);
+	hdr->header = header;
+	ret = __update_table_header(control, buff);
+	mutex_unlock(&control->tbl_mutex);
+
+	return ret;
+}
+
 int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control)
 {
 	unsigned char buff[EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE] = { 0 };
@@ -248,6 +266,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 	struct amdgpu_device *adev = to_amdgpu_device(control);
 	unsigned char buff[EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE] = { 0 };
 	struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
+	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 	struct i2c_msg msg = {
 			.addr	= 0,
 			.flags	= I2C_M_RD,
@@ -287,9 +306,16 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,

 	} else if ((hdr->header == EEPROM_TABLE_HDR_BAD) &&
 			(amdgpu_bad_page_threshold != 0)) {
-		*exceed_err_limit = true;
-		dev_err(adev->dev, "Exceeding the bad_page_threshold parameter, "
+		if (ras->bad_page_cnt_threshold > control->num_recs) {
+			dev_info(adev->dev, "Using one valid bigger bad page "
+				"threshold and correcting eeprom header tag.\n");
+			ret = amdgpu_ras_eeprom_correct_header_tag(control,
+							EEPROM_TABLE_HDR_VAL);
+		} else {
+			*exceed_err_limit = true;
+			dev_err(adev->dev, "Exceeding the bad_page_threshold parameter, "
 				"disabling the GPU.\n");
+		}
 	} else {
 		DRM_INFO("Creating new EEPROM table");