Hello,
I apologise for the lengthy post in advance.
The past weekend I ran into some trouble with my FreeNAS server. I'm not really an expert, just a home user with a bit above average knowledge on computers and linux / unix. I've been wandering around on the web all day, trying to get a grasp on what's going on in my server, and before adding days and days to that, I was hoping a post here could save me at least some time via an insightful suggestion from a more knowledgeable member.
I use an ASRock C2550D4I board with 16GB of ECC RAM and six WD RED 4TB disks in RAID-Z2. The past weekend, my server started sending me emails reporting SMART errors on the disks (all disks on the same Marvell controller), specifically "Read SMART Error Log Failed" and "Read SMART Self-Test Log Failed". I proceeded to scan the drives using smartctl, but nothing turned up and everything looked perfectly fine, so I ignored it for a bit.
During the night, however, more serious emails arrived, first saying "The volume nirvana (ZFS) state is DEGRADED", followed by "The volume nirvana (ZFS) state is UNAVAIL". In the weekend, I found one of the controllers on the motherboard had a firmware update, so I installed that along with the BIOS update that was still pending. Booting the system showed me the volume again, so things were looking up.
However, it didn't take long for the volume to become degraded again. A zpool status revealed:
So that looks like a dying drive, right? SMART, however, still claims everything is perfectly fine. At the same time, dmesg and my server's console output are going crazy with the following messages (continuously):
Due to the timeouts, the scrub currently in progress is stating it will take over 12 more hours to complete (and going up, probably due to the timeouts above?), even though it is at 99.07%.
What it boils down to is that I am wondering which actions I should take:
Any suggestions and tips are very much welcomed!
Best regards,
boerbiet
I apologise for the lengthy post in advance.
The past weekend I ran into some trouble with my FreeNAS server. I'm not really an expert, just a home user with a bit above average knowledge on computers and linux / unix. I've been wandering around on the web all day, trying to get a grasp on what's going on in my server, and before adding days and days to that, I was hoping a post here could save me at least some time via an insightful suggestion from a more knowledgeable member.
I use an ASRock C2550D4I board with 16GB of ECC RAM and six WD RED 4TB disks in RAID-Z2. The past weekend, my server started sending me emails reporting SMART errors on the disks (all disks on the same Marvell controller), specifically "Read SMART Error Log Failed" and "Read SMART Self-Test Log Failed". I proceeded to scan the drives using smartctl, but nothing turned up and everything looked perfectly fine, so I ignored it for a bit.
During the night, however, more serious emails arrived, first saying "The volume nirvana (ZFS) state is DEGRADED", followed by "The volume nirvana (ZFS) state is UNAVAIL". In the weekend, I found one of the controllers on the motherboard had a firmware update, so I installed that along with the BIOS update that was still pending. Booting the system showed me the volume again, so things were looking up.
However, it didn't take long for the volume to become degraded again. A zpool status revealed:
Code:
NAME STATE READ WRITE CKSUM nirvana DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/c26093b1-f91e-11e3-82f1-d05099264e54 ONLINE 0 0 0 gptid/c34d8533-f91e-11e3-82f1-d05099264e54 ONLINE 0 0 0 gptid/c437e8d4-f91e-11e3-82f1-d05099264e54 ONLINE 0 0 0 gptid/c5272d0f-f91e-11e3-82f1-d05099264e54 ONLINE 0 0 0 gptid/c6184f5b-f91e-11e3-82f1-d05099264e54 ONLINE 0 0 0 gptid/c702c33f-f91e-11e3-82f1-d05099264e54 DEGRADED 0 0 273 too many errors (repairing)
So that looks like a dying drive, right? SMART, however, still claims everything is perfectly fine. At the same time, dmesg and my server's console output are going crazy with the following messages (continuously):
Code:
(ada5:ahcich5:0:0:0): READ_DMA48. ACB: 25 00 38 dd fe 40 a6 01 00 00 00 01 (ada5:ahcich5:0:0:0): CAM status: Command timeout (ada5:ahcich5:0:0:0): Retrying command (ada5:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 (ada5:ahcich5:0:0:0): CAM status: Command timeout (ada5:ahcich5:0:0:0): Error 5, Retries exhausted (ada5:ahcich5:0:0:0): READ_DMA48. ACB: 25 00 38 dd fe 40 a6 01 00 00 00 01 (ada5:ahcich5:0:0:0): CAM status: ATA Status Error (ada5:ahcich5:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT ) (ada5:ahcich5:0:0:0): RES: 51 04 00 00 00 00 00 00 00 00 00 (ada5:ahcich5:0:0:0): Error 5, Retries exhausted (ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 38 e0 fe 40 a6 01 00 01 00 00 (ada5:ahcich5:0:0:0): CAM status: ATA Status Error (ada5:ahcich5:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) (ada5:ahcich5:0:0:0): RES: 41 40 cf e0 fe 40 a6 01 00 00 00 (ada5:ahcich5:0:0:0): Retrying command
Due to the timeouts, the scrub currently in progress is stating it will take over 12 more hours to complete (and going up, probably due to the timeouts above?), even though it is at 99.07%.
What it boils down to is that I am wondering which actions I should take:
- Accept the drive is broken and file an RMA (though I wonder if I have any chance of success, with no SMART errors showing).
- Point towards the controller, since before the FW upgrade it also cried foul of ada2, 3 and 4's SMART error logs and it seems to be timing out all the time on ada5.
- Wait for the scrub to complete, however more time that will take, perform a long SMART test on the drive and check those results first.
- Disable AHCI in the BIOS and try with that? (someone suggested this, though I have no real idea if it could make a difference as I don't know much about AHCI)
- Should I even continue the scrub? It has made 0.02% progress in 2 hours and almost seems to be stuck.
- Something else?
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 17749 3 Spin_Up_Time 0x0027 188 178 021 Pre-fail Always - 7566 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 16 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10294 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 16 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 12 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 212 194 Temperature_Celsius 0x0022 109 096 000 Old_age Always - 43 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 5 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
Any suggestions and tips are very much welcomed!
Best regards,
boerbiet
Last edited: