MorkaiTheWolf
Dabbler
- Joined
- Aug 8, 2018
- Messages
- 32
This evening I received an alert on my system stating that my main zpool (labeled Tank) had become degraded.
The specific messages:
I am unsure if I should try to reboot the system but I fear this is a sign that one of my disks is failing. To provide some further background, here is some system notes:
Version: FreeNAS-11.1-U6 (caffd76fa)
Motherboard: ASRock Motherboard ATX DDR3 1066 Intel LGA 2011 EP2C602-4L/D16
CPU: 2x Xeon E5-2680 v2 @ 2.80GHz
CPU Cooler: 2x Noctua i4
RAM: 56 GB (two 4GB sticks went bad that I still need to replace)
PSU: EVGA SuperNOVA 850 T2
Case: Phanteks Enthoo Pro
HBA: LSI 9210-8i
Storage:
6x WD Red HE 10TB
2x Samsung 850 EVO 250GB
Zpool status
glabel status:
With this in mind, I figured I would try to see what might be in the SMART report and that is where the weirdness begins, this is all I receive for an output:
This makes me think the drive might be dead or dying since I can't even read it. Now, I do have a SMART status report running that should send me updates periodically and here are the results from 7am this morning:
I do feel like I am missing something though in what could be the cause. Would it be worth it to reboot the system to see if I can run a smartctl -a against the drive again or would that cause more issues?
Luckily, I've documented where the drive is in the case so a swap shouldn't be terribly difficult but it will take some time to find a 10TB replacement for a decent price.
The specific messages:
Code:
CRITICAL: Dec. 23, 2018, 6:22 p.m. - Device: /dev/da4 [SAT], not capable of SMART self-check CRITICAL: Dec. 23, 2018, 6:22 p.m. - Device: /dev/da4 [SAT], failed to read SMART Attribute Data CRITICAL: Dec. 23, 2018, 6:22 p.m. - Device: /dev/da4 [SAT], Read SMART Self-Test Log Failed CRITICAL: Dec. 23, 2018, 6:22 p.m. - Device: /dev/da4 [SAT], Read SMART Error Log Failed
I am unsure if I should try to reboot the system but I fear this is a sign that one of my disks is failing. To provide some further background, here is some system notes:
Version: FreeNAS-11.1-U6 (caffd76fa)
Motherboard: ASRock Motherboard ATX DDR3 1066 Intel LGA 2011 EP2C602-4L/D16
CPU: 2x Xeon E5-2680 v2 @ 2.80GHz
CPU Cooler: 2x Noctua i4
RAM: 56 GB (two 4GB sticks went bad that I still need to replace)
PSU: EVGA SuperNOVA 850 T2
Case: Phanteks Enthoo Pro
HBA: LSI 9210-8i
Storage:
6x WD Red HE 10TB
2x Samsung 850 EVO 250GB
Zpool status
Code:
pool: Jails state: ONLINE scan: scrub repaired 0 in 0 days 00:04:14 with 0 errors on Sat Dec 1 05:04:14 2018 config: NAME STATE READ WRITE CKSUM Jails ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gptid/d7e792f0-aff6-11e8-a085-d05099c3f976 ONLINE 0 0 0 gptid/d86c6f5e-aff6-11e8-a085-d05099c3f976 ONLINE 0 0 0 errors: No known data errors pool: Tank state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0 in 0 days 13:44:27 with 0 errors on Sun Dec 2 13:44:28 2018 config: NAME STATE READ WRITE CKSUM Tank DEGRADED 0 0 0 raidz3-0 DEGRADED 0 0 0 gptid/1f66829a-d719-11e8-a0f2-d05099c3f976 ONLINE 0 0 0 gptid/20911f23-d719-11e8-a0f2-d05099c3f976 ONLINE 0 0 0 gptid/21b89c69-d719-11e8-a0f2-d05099c3f976 FAULTED 1 1 0 too many errors gptid/22d8de81-d719-11e8-a0f2-d05099c3f976 ONLINE 0 0 0 gptid/23f9ec88-d719-11e8-a0f2-d05099c3f976 ONLINE 0 0 0 gptid/251ff784-d719-11e8-a0f2-d05099c3f976 ONLINE 0 0 0 errors: No known data errors pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0 days 00:01:54 with 0 errors on Sun Dec 23 03:46:54 2018 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 da8p2 ONLINE 0 0 0
glabel status:
Code:
Name Status Components gptid/d7e792f0-aff6-11e8-a085-d05099c3f976 N/A da0p2 gptid/d86c6f5e-aff6-11e8-a085-d05099c3f976 N/A da1p2 gptid/1f66829a-d719-11e8-a0f2-d05099c3f976 N/A da2p2 gptid/20911f23-d719-11e8-a0f2-d05099c3f976 N/A da3p2 gptid/21b89c69-d719-11e8-a0f2-d05099c3f976 N/A da4p2 gptid/22d8de81-d719-11e8-a0f2-d05099c3f976 N/A da5p2 gptid/23f9ec88-d719-11e8-a0f2-d05099c3f976 N/A da6p2 gptid/251ff784-d719-11e8-a0f2-d05099c3f976 N/A da7p2 gptid/27bbfa76-afe1-11e8-8edb-d05099c3f976 N/A da8p1
With this in mind, I figured I would try to see what might be in the SMART report and that is where the weirdness begins, this is all I receive for an output:
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org Smartctl open device: /dev/da4 failed: INQUIRY failed
This makes me think the drive might be dead or dying since I can't even read it. Now, I do have a SMART status report running that should send me updates periodically and here are the results from 7am this morning:
Code:
########## SMART status report for da4 drive (: 7JGGTLZC) ########## smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build) SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0004 130 130 054 Old_age Offline - 108 3 Spin_Up_Time 0x0007 149 149 024 Pre-fail Always - 439 (Average 442) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 17 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0 8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1900 10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 17 22 Unknown_Attribute 0x0023 100 100 025 Pre-fail Always - 100 192 Power-Off_Retract_Count 0x0032 098 098 000 Old_age Always - 2422 193 Load_Cycle_Count 0x0012 098 098 000 Old_age Always - 2422 194 Temperature_Celsius 0x0002 250 250 000 Old_age Always - 26 (Min/Max 20/36) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Extended offline Completed without error 00% 246 -
I do feel like I am missing something though in what could be the cause. Would it be worth it to reboot the system to see if I can run a smartctl -a against the drive again or would that cause more issues?
Luckily, I've documented where the drive is in the case so a swap shouldn't be terribly difficult but it will take some time to find a 10TB replacement for a decent price.