One drive seems OK but still has hot spare in use, another shows fail but not using spare

wb-wpb

Cadet
Joined
May 3, 2016
Messages
2
This is an older 45drive system with 45 x 4 TB drives that we reformatted and installed TrueNAS 13.0-U5.2. I know it has things that are not popular like the Highpoint Rocket 750 card. I also did not follow best practices by building multiple vdevs to join into a pool. Neither of those should change underlying disk mechanics though.

I had a drive go bad last week (da20). Hot spare kicked in. Ordered replacement drive and another for stock. While replacing bad drive, I probably bumped its neighbor (da19) just enough and the other hot spare kicked in for that drive. Finished replacement of first drive and that spare returned to available. The second drive shows no SMART issues in the short test log and I don't see any other problems with it through zfs, but the spare is still attached to that drive. The only error I received on the da19 is this:
* Device: /dev/hptnr [hpt_disk_1/20/1], Read SMART Error Log Failed.

This morning I got a similar error on a third drive (da34) but hot spare did not kick in.
* Device: /dev/hptnr [hpt_disk_2/11/1], Self-Test Log error count increased from 0 to 1.

I know both drives should probably be replaced but would replace da34 before da19.
Also wondering why hot spare kicked in on da19 but not da34. I read in another thread that zfs is not smart aware so does not kick in spares based on that but the "failed to read SMART" is the only error I saw for da19 and the error for da34 seems more definitive.
I read in another thread ( https://www.truenas.com/community/t...ad-drive-replacement-resilver-complete.88796/ ) to try zpool detach on the spare drive that is still in use. I tried the detach from the GUI on the in use spare but it errored out with "[EZFS_NOTSUP] Cannot detach root-level vdevs".

Results from smartctrl -a on the 2 drives are:
Code:
  (da19)
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       4
  3 Spin_Up_Time            0x0027   230   175   021    Pre-fail  Always       -       5491
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       51
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   002   002   000    Old_age   Always       -       72182
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       51
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       50
193 Load_Cycle_Count        0x0032   190   190   000    Old_age   Always       -       31309
194 Temperature_Celsius     0x0022   127   119   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6641         -
# 2  Short offline       Completed without error       00%      6628         -
# 3  Short offline       Completed without error       00%      5242         -

Code:
 (da34) - hpt_disk_2/11/1
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       10
  3 Spin_Up_Time            0x0027   182   173   021    Pre-fail  Always       -       7866
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       49
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       73195
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       49
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3343
194 Temperature_Celsius     0x0022   125   118   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       50%      7653         4333832
# 2  Short offline       Completed without error       00%      6262         -
# 3  Short offline       Completed without error       00%      6087         -
 

wb-wpb

Cadet
Joined
May 3, 2016
Messages
2
I did finally go in to shell and ran "zpool detach mypool sparedriveid " which did not error out and did return the spare to available as it did in the other thread.

Pool is showing healthy but I will wait a while longer on replacing the one drive that is showing smart error.
 
Top