Thanks in advance for any clarity!
During a scrub of my RZ2 pool disk da4 generated many errors and was faulted. So this is the current state:
I completed a nine hour smart control long test and didn't see any errors:
In the system log there were many SCSI errors:
This leads to be to think maybe a cable became unseated or was not in contact. I opened the case and carefully wiggled the cables attached to da4.
So that is where we are now. The question maybe somebody can help me with is what should I do next? I am thinking perhaps a zpool clear command to clear the errors out of the faulted disk. So I assume I would issue the form of the pool clear command that just referenced the disc da4. However as the only errors that are showing for the pool are associated with the faulted disc, would there be any difference in issuing the zpool clear command for the whole pool? Another factor to note is that this RZ2 pool is encrypted. And I have 2 backups of important data.
Another thing I discovered is that the system device da4 seems always associated with the same disc. In this forum it is said that this relationship can change. Perhaps with swapping cables around it does change, but in my system it seems to be the same after reboots.
I did see information on replacing a disc in the manual but not any information for this.
In the forum some have said one must off-line a disc first. If I just clear the errors will it automatically rejoin the pool?
Fortunately the system is very stable and this doesn't happen until now, so I thought I would check in with the forum expertise before I make a mistake.
Thanks.
During a scrub of my RZ2 pool disk da4 generated many errors and was faulted. So this is the current state:
Code:
pool: rz2 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0 in 14h13m with 0 errors on Sun Mar 15 14:13:11 2020 config: NAME STATE READ WRITE CKSUM rz2 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/17adb8b7-b58b-11e5-b4f6-000c298fbe2c.eli ONLINE 0 0 0 gptid/1873efe0-b58b-11e5-b4f6-000c298fbe2c.eli ONLINE 0 0 0 gptid/195e2a96-b58b-11e5-b4f6-000c298fbe2c.eli ONLINE 0 0 0 gptid/1a3e1bae-b58b-11e5-b4f6-000c298fbe2c.eli FAULTED 45 134 0 too many errors gptid/1b098197-b58b-11e5-b4f6-000c298fbe2c.eli ONLINE 0 0 0 errors: No known data errors
I completed a nine hour smart control long test and didn't see any errors:
Code:
>>>freenas: root@/mnt/rz2/data/gb$: smartctl -t long /dev/da4 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 545 minutes for test to complete. Test will complete after Mon Mar 16 20:38:10 2020 ***** >>>freenas: root@/mnt/rz2/data/gb$: smartctl -a /dev/da4 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build) Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1 3 Spin_Up_Time 0x0027 171 170 021 Pre-fail Always - 8441 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 44 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 027 027 000 Old_age Always - 53682 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 44 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 42 193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 4375 194 Temperature_Celsius 0x0022 122 113 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 SMART Error Log Version: 1 No Errors Logged
In the system log there were many SCSI errors:
Code:
Mar 15 07:41:08 freenas (da4:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 79 1f 4c 88 00 00 00 08 00 00 length 4096 SMID 780 terminated ioc 804b scsi 0 state c xfer 0 Mar 15 07:41:08 freenas (da4:mps0:0:3:0): READ(10). CDB: 28 00 c1 f8 3b e8 00 00 50 00 length 40960 SMID 765 terminated ioc 804b scsi 0 state c xfer 0 Mar 15 07:41:08 freenas (da4:mps0:0:3:0): READ(10). CDB: 28 00 c1 f8 3b 38 00 00 58 00 length 45056 SMID 556 terminated ioc 804b scsi 0 state c xfer 0 Mar 15 07:41:08 freenas (da4:mps0:0:3:0): READ(10). CDB: 28 00 c1 f8 3b 38 00 00 58 00 Mar 15 07:41:08 freenas (da4:mps0:0:3:0): CAM status: SCSI Status Error Mar 15 07:41:08 freenas (da4:mps0:0:3:0): SCSI status: Check Condition Mar 15 07:41:08 freenas (da4:mps0:0:3:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable) Mar 15 07:41:08 freenas (da4:mps0:0:3:0): Retrying command (per sense data) Mar 15 07:41:08 freenas (da4:mps0:0:3:0): READ(10). CDB: 28 00 c1 f8 3b e8 00 00 50 00 Mar 15 07:41:08 freenas (da4:mps0:0:3:0): CAM status: SCSI Status Error Mar 15 07:41:08 freenas (da4:mps0:0:3:0): SCSI status: Check Condition
This leads to be to think maybe a cable became unseated or was not in contact. I opened the case and carefully wiggled the cables attached to da4.
So that is where we are now. The question maybe somebody can help me with is what should I do next? I am thinking perhaps a zpool clear command to clear the errors out of the faulted disk. So I assume I would issue the form of the pool clear command that just referenced the disc da4. However as the only errors that are showing for the pool are associated with the faulted disc, would there be any difference in issuing the zpool clear command for the whole pool? Another factor to note is that this RZ2 pool is encrypted. And I have 2 backups of important data.
Another thing I discovered is that the system device da4 seems always associated with the same disc. In this forum it is said that this relationship can change. Perhaps with swapping cables around it does change, but in my system it seems to be the same after reboots.
I did see information on replacing a disc in the manual but not any information for this.
In the forum some have said one must off-line a disc first. If I just clear the errors will it automatically rejoin the pool?
Fortunately the system is very stable and this doesn't happen until now, so I thought I would check in with the forum expertise before I make a mistake.
Thanks.