Alrighty, so here's where I'm at now...
All of my drives are physically out of my system. I plugged each of them in one by one and did a reboot in between each time to see if they showed up in BIOS or not. One by one, they all showed up. Success! Sort of. ***may or may not be important to note that I'm pretty sure none of the drives were attached to their original SATA port on the motherboard when I plugged it all back in. I seem to remember something a while back saying that wasn't an issue though.
Next, I let it boot into FreeNAS. Everything was going better than previously when I started seeing messages something along the lines of .. doing scan sync txg... with some other numbers and things. Some searches led me to believe that may indicate resilvering. Sure enough, it started resilvering one of the drives last night (this computer has been packed up for over three months and I can't remember what it's exact state was when we were getting things ready to move).
Anyway, ran some errands this morning (it was still resilvering). While I was out, received an e-mail for a critical alert "The volume spradlin state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected."
Just got home a bit ago, and this is what I'm seeing:
zpool status:
Code:
pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:04:46 with 0 errors on Sat Jan 4 03:49:46 2020
config:
NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
gptid/6da54db8-b264-11e4-8c0d-bc5ff4e7a55f ONLINE 0 0 0
errors: No known data errors
pool: spradlin
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: resilvered 2.42T in 0 days 08:04:28 with 0 errors on Sat Jan 4 09:32:21 2020
config:
NAME STATE READ WRITE CKSUM
spradlin ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/20958994-9461-11e9-be2d-bc5ff4e7a55f ONLINE 0 0 108
gptid/26159776-92c0-11e4-a186-bc5ff4e7a55f ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/137f3c38-dd06-11e5-990e-bc5ff4e7a55f ONLINE 0 0 0
gptid/842ac0d2-ae7d-11e8-b1c8-bc5ff4e7a55f ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/1a92ec0a-dd06-11e5-990e-bc5ff4e7a55f ONLINE 0 0 0
gptid/967e579f-b001-11e8-8f71-bc5ff4e7a55f ONLINE 0 0 0
errors: No known data errors
The only thing I notice in this is a crap ton of checksum errors, and FreeNAS is recommending
zpool clear
.
Here's the smartctl output, formatted for reading:
Code:
########## SMART status report summary for all drives ##########
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
|Device|Serial |Temp|Power|Start|Spin |ReAlloc|Current|Offline |UDMA |Seek |High |Command|Last|
| | | |On |Stop |Retry|Sectors|Pending|Uncorrec|CRC |Errors|Fly |Timeout|Test|
| | | |Hours|Count|Count| |Sectors|Sectors |Errors| |Writes|Count |Age |
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
|ada0 ?|WD-WMC4N0MAL5D1| 28 |28349| 46| 0| 0| 0| 0| 0| N/A| N/A| N/A| 13|
|ada1 ?|WD-WCC7K0KJXKJ3| 28 | 6941| 23| 0| 0| 0| 0| 0| N/A| N/A| N/A| 13|
|ada2 ?|WD-WCC4N7FSXRPT| 29 |28349| 46| 0| 0| 0| 0| 5| N/A| N/A| N/A| 276|
|ada3 ?|WD-WCC4N7FSXEU2| 29 |28348| 45| 0| 0| 0| 0| 0| N/A| N/A| N/A| 13|
|ada4 ?|WD-WMC4N1069088| 26 |45591| 215| 0| 0| 0| 0| 0| N/A| N/A| N/A| 58|
|ada5 ?|WD-WCC7K7LJL4TR| 26 | 170| 13| 0| 0| 0| 0| 0| N/A| N/A| N/A| 7|
+------+---------------+----+-----+-----+-----+-------+-------+--------+------+------+------+-------+----+
########## SMART status report for ada0 drive (Western Digital Red: WD-WMC4N0MAL5D1) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 182 182 021 Pre-fail Always - 5858
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 46
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 28349
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 45
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 14
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 922
194 Temperature_Celsius 0x0022 122 097 000 Old_age Always - 28
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
No Errors Logged
Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
Extended offline Completed without error 00% 28037 -
########## SMART status report for ada1 drive (Western Digital Red: WD-WCC7K0KJXKJ3) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 169 169 021 Pre-fail Always - 6541
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 6941
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 23
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 96
194 Temperature_Celsius 0x0022 122 104 000 Old_age Always - 28
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
No Errors Logged
Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
Extended offline Completed without error 00% 6638 -
########## SMART status report for ada2 drive (Western Digital Red: WD-WCC4N7FSXRPT) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 176 175 021 Pre-fail Always - 6183
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 46
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 28349
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 45
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 14
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1948
194 Temperature_Celsius 0x0022 121 101 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 199 000 Old_age Always - 5
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
No Errors Logged
Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
Extended offline Completed without error 00% 21734 -
########## SMART status report for ada3 drive (Western Digital Red: WD-WCC4N7FSXEU2) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 35
3 Spin_Up_Time 0x0027 176 176 021 Pre-fail Always - 6183
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 45
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 28348
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 44
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 13
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1078
194 Temperature_Celsius 0x0022 121 095 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
No Errors Logged
Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
Extended offline Completed without error 00% 28036 -
########## SMART status report for ada4 drive (Western Digital Red: WD-WMC4N1069088) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 211 175 021 Pre-fail Always - 4450
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 215
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 038 038 000 Old_age Always - 45591
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 206
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 40
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 789
194 Temperature_Celsius 0x0022 124 090 000 Old_age Always - 26
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
ATA Error Count: 46 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 46 occurred at disk power-on lifetime: 45100 hours (1879 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 50 01 47 43 Error: UNC 8 sectors at LBA = 0x03470150 = 54985040
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 50 01 47 43 08 08:36:42.681 READ DMA
c8 00 08 50 01 47 43 08 08:36:35.682 READ DMA
c8 00 08 50 01 47 43 08 08:36:28.683 READ DMA
c8 00 08 50 01 47 43 08 08:36:21.684 READ DMA
c8 00 08 50 01 47 43 08 08:36:14.685 READ DMA
Error 45 occurred at disk power-on lifetime: 45100 hours (1879 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 50 01 47 43 Error: UNC 8 sectors at LBA = 0x03470150 = 54985040
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 50 01 47 43 08 08:36:35.682 READ DMA
c8 00 08 50 01 47 43 08 08:36:28.683 READ DMA
c8 00 08 50 01 47 43 08 08:36:21.684 READ DMA
c8 00 08 50 01 47 43 08 08:36:14.685 READ DMA
Error 44 occurred at disk power-on lifetime: 45100 hours (1879 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 50 01 47 43 Error: UNC 8 sectors at LBA = 0x03470150 = 54985040
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 50 01 47 43 08 08:36:28.683 READ DMA
c8 00 08 50 01 47 43 08 08:36:21.684 READ DMA
c8 00 08 50 01 47 43 08 08:36:14.685 READ DMA
Error 43 occurred at disk power-on lifetime: 45100 hours (1879 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 50 01 47 43 Error: UNC 8 sectors at LBA = 0x03470150 = 54985040
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 50 01 47 43 08 08:36:21.684 READ DMA
c8 00 08 50 01 47 43 08 08:36:14.685 READ DMA
Error 42 occurred at disk power-on lifetime: 45100 hours (1879 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 50 01 47 43 Error: UNC 8 sectors at LBA = 0x03470150 = 54985040
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 50 01 47 43 08 08:36:14.685 READ DMA
Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
Extended offline Interrupted (host reset) 90% 44198 -
########## SMART status report for ada5 drive (Western Digital Red: WD-WCC7K7LJL4TR) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 186 181 021 Pre-fail Always - 5658
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 13
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 199 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 170
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 4
194 Temperature_Celsius 0x0022 124 107 000 Old_age Always - 26
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
No Errors Logged
That should include everything, but I have the raw output attached.
There's a lot going on here. Biggest problem I see with this is that ada4 has some errors, and the load cycle counts for some of the drives seem to be relatively high.
I'm not sure how many of my problems were caused by my cable situation earlier. I was getting a pretty consistent error on one of them before we moved and I just hadn't had time to look into it, but looking back, again, it may have been cable related.
Like I mentioned above, my primary goal is to get everything healthy enough to work until I back it all up to an external drive or something so that I can start over from scratch with a RAIDz2 setup (I realize this could have been solved had I been doing backups all along).
Any thoughts/suggestions on where I should go from here? Moving everything to a bigger case will definitely happen sooner than later. Should I worry about doing a zpool clear? Are things good enough for me to start backing up now and just worry about everything after I start over?
Is there anything else I should check? Do y'all need any more info?
Thanks!