Advice needed for pool with multiple hd failures

asmodeus

Explorer
Joined
Jul 26, 2016
Messages
70
Hello,

I need some advice with a pool where multiple hard disks are showing signs of failure.
The first disk began to show signs of failure a few days ago with 198 Offline Uncorrectable count increasing after a failed SMART test.
I have offlined this disk and already installed its replacement, which is now completing its final smart test after badblocks.

Now, while badblocks was running on the replacement, two more disks (out of a total of eight in a raidZ2) are starting to exhibit the same problem: 198 Offline Uncorrectable count increasing. The pool is currently in a degraded state, but still healthy. I almost cannot believe the amount of chance I ran into (3 out of 8 disks starting to fail within 3 days).

What is the best course of action now? I'm thinking to start to re-silver the replacement disk and cross my fingers that the other two don't give up on me?
Or would it be more advisable to just tear down the pool, replace all three disks and restore from backup? Other thoughts?

Thanks,
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
Is it possible that your disk controller or your power supply is failing, rather than several drives failing at the same time?
 

asmodeus

Explorer
Joined
Jul 26, 2016
Messages
70
That's certainly a possibility, I did notice a high-pitch wheezing noise the other day.
How can I be sure? And would that really cause Offline Uncorrectable errors on multiple disk?

I'm using a supermicro X11SSL-CF, with the onboard SAS controller. The whole system is less than three years old.
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
The reason I suggested those as possible causes is that a few years back when I reported here that two of the drives in a RaidZ2 pool had suddenly started showing errors, people suggested that I check the power supply and the data connections. As the PSU I was using was not a renowned brand, and possibly the same could be said for the cables, I replaced both, and the problem did not return.

Do you have another machine in which you can install the problem drives (after shutting down your FreeNAS machine, of course) and run S.M.A.R.T. tests?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Uncorrectable sectors is a disk thing though. If you were seeing UDMA CRC errors etc you’d think it was the power/cables etc

I would suggest resilvering as fast as possible. Also I wouldn’t offline.

If a disk has failing sectors the rest of the data is still good. For a failure to affect raidz2 the failure has to be co-located in the same ‘location’ (logically) on 3 disks.

By offlining the first disk you essentially increased the amount of failing sectors from a few hundred to 100% of a disk. Reducing your raid2 redundancy immensely.
 

asmodeus

Explorer
Joined
Jul 26, 2016
Messages
70
Thank you both for your replies!

I believe Stux is correct in that re-running a SMART test on another machine would not fix Uncorrectable errors. There were no CRC in the smart output, so it really looks like failing disks. I am replacing all three of them under warranty.

The replacement disk is now resilvering, the other two with Uncorrectable sectors are still part of the pool. You mean no further offlining of disks before a resilver right?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
You mean no further offlining of disks before a resilver right?
If you have a spare SATA port, such that you could connect the replacement disk without removing the disk it's replacing, don't offline it at all. Just do the replacement, and the old disk will go offline automatically when the resilver completes. This preserves your redundancy, which is probably a good idea when you have a few questionable disks.
 

asmodeus

Explorer
Joined
Jul 26, 2016
Messages
70
Great idea on preserving redundancy while the disks are replacing! Can a disk be moved to another SAS/SATA port after being introduced as a replacement disk to the pool? Currently all eight SAS ports are used for the pool, and there are 6 more SATA ports available. I (eventually) want them all on the SAS ports, not a mix of the two...
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

asmodeus

Explorer
Joined
Jul 26, 2016
Messages
70
Awesome, thanks for confirming! Guess I'll be trying that soon :) Is there any manual interaction required? Oh and does it work with encrypted pools?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Oh and does it work with encrypted pools?
I'd expect so, but be sure to carefully follow the rekeying requirements. I don't use pool encryption, so I can't say more than that.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
This is one of the reasons I leave an empty hot swap tray in a 5-1/4 bay in my case, connected to the motherboard SATA ports. I want to be able to insert a replacement, and resilver/replace without spinning down the other drives. ZFS will generally work out the pool geometry after moving the drive to it's permanent location. But I can't say I've tried it with an encrypted pool.
 

tfran1990

Patron
Joined
Oct 18, 2017
Messages
294
Just out of curiosity what disc and how many spin hours? Im trying to get an idea of when i should replace a couple drives in my pool. i have 5 in raidz2 they all were put in pool at the same time so there is no staggered amount of spin time on them, they are all at about 20k.
 

asmodeus

Explorer
Joined
Jul 26, 2016
Messages
70
I was running eight 4TB Seagate ST4000VN000. They currently report 24346h of Power_On_Hours. I would not replace disks in a pool proactively, if you have SMART tests scheduled that should (hopefully) provide an advance warning. I also invested in a second box to send snapshots to literally a month before the first disk started failing. Consider this an advert for backups ;-)
 
Top