SOLVED Pool errors - read, write, checksum - after update from 11.2-u1 to 11.2-u2.1

Kirill_v_b

Cadet
Joined
Apr 22, 2017
Messages
4
Hello,

After update to 11.2-u2.1 from 11.2-u1 pool (RaidZ3, 8 hdd > 7 + 1 spare, on controller Supermicro LSI2308-IT) going to resilver. Zpool status start shows a lot of errors (read, write, checksum) at ALL pool devices. Smartctl for ALL devices records error like this :

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 98 98 cb 40 40 08 00:52:23.536 WRITE FPDMA QUEUED
61 10 90 08 61 89 40 08 00:52:23.534 WRITE FPDMA QUEUED
61 10 88 f8 85 d2 40 08 00:52:23.534 WRITE FPDMA QUEUED
61 10 80 10 ee a0 40 08 00:52:22.898 WRITE FPDMA QUEUED
61 10 78 10 ec a0 40 08 00:52:22.896 WRITE FPDMA QUEUED

Error 94 occurred at disk power-on lifetime: 17584 hours (732 days + 16 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 a8 10 6e 89 40 08 00:52:22.322 WRITE FPDMA QUEUED
ef 02 00 00 00 00 40 08 00:52:22.321 SET FEATURES [Enable write cache]
ef aa 00 00 00 00 40 08 00:52:22.319 SET FEATURES [Enable read look-ahead]
c6 00 10 00 00 00 40 08 00:52:22.319 SET MULTIPLE MODE
ef 10 02 00 00 00 40 08 00:52:22.318 SET FEATURES [Enable SATA feature]

after, in around 15 minutes, system hangs completely and only power reset can restart server.

Rebooting with previous behavior 11.2-u1 or 11.2-Release will start resilvering again. During resilvering no errors at all, no new errors in smartctl. After resilvering is finished the pool will return to normal state and working without any errors.

Trying to update again from 11.2-u1 to 11.2-u2.1 will start nightmare again - Resilvering, zpool status start shows a lot of errors (read, write, checksum) at ALL pool devices, etc.

Please, help me find / solve / understand the problem.

Thanks and sorry for the broken English!
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
Are you confident with your power supply? I've met ABRT error with a similar random behaviour during boot process in the past and it was a power supply trouble during the disks start. Since I 've changed the power supply, problem doesn't occur anymore. Just food for thought.
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
Since, I've errors logged in the disk memory (today my disk Power_On_Hours value=50452)
Code:
Error 40 occurred at disk power-on lifetime: 38804 hours (1616 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 40  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 40 00      00:09:38.493  SET FEATURES [Set transfer mode]
  ef 03 46 00 00 00 40 00      00:09:38.489  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 40 00      00:09:38.488  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      00:09:24.007  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 00 00      00:09:24.007  IDENTIFY DEVICE
 
Last edited:

Kirill_v_b

Cadet
Joined
Apr 22, 2017
Messages
4
Are you confident with your power supply? I've met ABRT error with a similar random behaviour during boot process in the past and it was a power supply trouble during the disks start. Since I 've changed the power supply, problem doesn't occur any more. Just food for thought.

I don't think that some problem with PS, I have Redundant (1+1) 500W Platinum Level (94%) in Supermicro CSE-836TQ-R500B case. Max historical power consumption was 324W. No PS error was registered with Supermicro IPMI server health log.

I have 2 pools, one pool of 8 hdd using LSI2308-IT controller and the second pool of another 8 hdd using Intel® C612 controller (chipset of MB Supermicro X10drl-i), so the second pool not encounter any errors after updating to 11.2-U2.1... Probably some problem with LSI2308-IT controller under 11.2-U2.1.

Just of curiosity I reboot the system and start 11.2-u2.1 once again and the problem return immediately. So I reboot again with 11.2-U3, wait for reslvering and scrub to complete. Finally, I have working pool without any errors.
 
Top