Replacing multiple "failing" drives

leenux_tux · Jul 30, 2018

Hello fellow FreeNazzers,

I have a query regarding the replacement of multiple hard drives on my FreeNAS box. I'm familiar with the process having replaced a single drive in the past but I am now getting errors on two drives which is slightly worrying !!

Configuration is 10 X 2TB, RAIDZ2. Errors I am seeing are...

CRITICAL..... Device /dev/da8 12 Currently unreadable (pending) sectors
CRITICAL..... Device /dev/da6 1 Currently unreadable (pending) sectors
CRITICAL..... Device /dev/da6 1 offline uncorrectable sectors.

I need to get these replaced, however, its it safe, or even possible, to replace two at the same time ? Has anyone else been through a similar issue/process ?

Also, I won't be sending these drives back to the manufacturer as the server has been running without issues for over 6 years now so I consider myself quite lucky !!

Many thanks

L

hugovsky · Jul 30, 2018

I think you should change that disks ASAP. To me, your best option is to change it one at a time. That way you don't lose redundancy if you are using RADZ2+. Don't forget to test your new disks.

Chris Moore · Jul 30, 2018

leenux_tux said:
Hello fellow FreeNazzers,

I have a query regarding the replacement of multiple hard drives on my FreeNAS box. I'm familiar with the process having replaced a single drive in the past but I am now getting errors on two drives which is slightly worrying !!

Configuration is 10 X 2TB, RAIDZ2. Errors I am seeing are...

CRITICAL..... Device /dev/da8 12 Currently unreadable (pending) sectors
CRITICAL..... Device /dev/da6 1 Currently unreadable (pending) sectors
CRITICAL..... Device /dev/da6 1 offline uncorrectable sectors.

I need to get these replaced, however, its it safe, or even possible, to replace two at the same time ? Has anyone else been through a similar issue/process ?

Also, I won't be sending these drives back to the manufacturer as the server has been running without issues for over 6 years now so I consider myself quite lucky !!

Many thanks

L

I have been there before and done that.
It is possible to replace them both at once, but I had a full backup server with a live copy of the data if something catastrophic happened. The recommendation would be to replace them one at a time.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Arwen · Jul 30, 2018

Ideally you have a free disk slot. That would allow you to replace in-place. Meaning you could replace a disk without loosing any redundancy you still have. Internally ZFS mirrors the failing disk with the replacement. Any data blocks that can't be obtained from the failing disk, are retrieved from the rest of the RAID-Z2. Whence the re-silver is complete, the failing disk is removed from the pool, (but the replacement disk is now in it's place).

This is a feature that is somewhat unique to ZFS. I've wanted this feature since 2000 when I got bit by 2 disks with bad blocks, (not the same), in a RAID-5. So I was screwed. Full backup, rebuild RAID-5 and restore.

But, if you don't have any free slots, then you have to perform a simpler disk replacement and hope you don't have a 3rd disk failure.

Chris Moore · Jul 30, 2018

I have tested the replace of a drive in RAIDz2 and found that a replacement disk is resilvered much more quickly if the defective disk is first removed from the pool. If you leave the defective disk in the pool, it slows down the function of the resilver and can significantly delay the completion.
As da6 just has a single sector error, I would replace the da8 first. When that completes, you can replace da6. Since these are only 2TB drives, you should be able to complete the resilver in under 6 hours per drive.

leenux_tux · Jul 30, 2018

Thanks for the replies.

Arwen, great idea. All my disks are connected via two IBM M1015. None of the SATA connectors on the Motherboard are being used so in theory I could...

Buy two new disks
Connect them up directly to the Motherboard
Remove the old disks via the GUI
Add the new ones in and let the resilver complete.
Then physically replace the failed disks with the new ones

Sound about right ?

All my data is backed up to another server running a Linux based ZFS (Snapshots) so my data is safe.

leenux_tux · Jul 30, 2018

Chris Moore said:
I have tested the replace of a drive in RAIDz2 and found that a replacement disk is resilvered much more quickly if the defective disk is first removed from the pool. If you leave the defective disk in the pool, it slows down the function of the resilver and can significantly delay the completion.
As da6 just has a single sector error, I would replace the da8 first. When that completes, you can replace da6. Since these are only 2TB drives, you should be able to complete the resilver in under 6 hours per drive.

Good tip. I will bear that in mind. Many thanks

Chris Moore · Jul 30, 2018

leenux_tux said:
Arwen, great idea. All my disks are connected via two IBM M1015. None of the SATA connectors on the Motherboard are being used so in theory I could...

Buy two new disks

Connect them up directly to the Motherboard

Remove the old disks via the GUI

Add the new ones in and let the resilver complete.

Then physically replace the failed disks with the new ones

Sound about right ?

I don't think that is what Arwen was saying. There are basically two methods of replacing a faulty disk. I call them 'in place' or 'remove and replace'. In an 'in place' replacement, you would have both the new drive and the old drive online. In the GUI, there is an option for telling FreeNAS to replace an online drive with another drive that is in the system but not part of a pool. There is no need to offline or remove the defective drive because, when you tell FreeNAS to replace the drive, it attempts to copy data from the defective drive to the replacement drive first. When the resilver is complete, FreeNAS offlines the drive that was replaced. After the drive has been offlined by FreeNAS, then you would remove it physically from the system.

The method I prefer is to remove the defective drive completely (offline it in the GUI) and pull it out like a bad tooth. Then put the new drive in and tell the system to replace the missing drive with the new drive. I have found, from trying both methods and timing the results, that remove and replace is faster than the 'in place' method and I have guessed that the reason is that the system is not attempting to read from the defective drive. The CPU utilization will be high during resilver because the system is recomputing all the checksum data but as long as you have a strong CPU, the math involved in doing that will not slow down the process. The bottleneck should be the speed the replacement drive is able to write data to disk.

Ericloewe · Jul 30, 2018

Fun fact: Allan Jude recently resilvered ten disks at a time (for expansion, so the pool was absolutely healthy, not that it would matter). That was a fun thing to see on IRC.

Important Announcement for the TrueNAS Community.

Replacing multiple "failing" drives

leenux_tux

Patron

hugovsky

Guru

Chris Moore

Hall of Famer

Arwen

MVP

Chris Moore

Hall of Famer

leenux_tux

Patron

leenux_tux

Patron

Chris Moore

Hall of Famer

Ericloewe

Server Wrangler

Similar threads