Replacing multiple "failing" drives

Status
Not open for further replies.

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
Hello fellow FreeNazzers,

I have a query regarding the replacement of multiple hard drives on my FreeNAS box. I'm familiar with the process having replaced a single drive in the past but I am now getting errors on two drives which is slightly worrying !!

Configuration is 10 X 2TB, RAIDZ2. Errors I am seeing are...

CRITICAL..... Device /dev/da8 12 Currently unreadable (pending) sectors
CRITICAL..... Device /dev/da6 1 Currently unreadable (pending) sectors
CRITICAL..... Device /dev/da6 1 offline uncorrectable sectors.

I need to get these replaced, however, its it safe, or even possible, to replace two at the same time ? Has anyone else been through a similar issue/process ?

Also, I won't be sending these drives back to the manufacturer as the server has been running without issues for over 6 years now so I consider myself quite lucky !!

Many thanks

L
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Hello fellow FreeNazzers,

I have a query regarding the replacement of multiple hard drives on my FreeNAS box. I'm familiar with the process having replaced a single drive in the past but I am now getting errors on two drives which is slightly worrying !!

Configuration is 10 X 2TB, RAIDZ2. Errors I am seeing are...

CRITICAL..... Device /dev/da8 12 Currently unreadable (pending) sectors
CRITICAL..... Device /dev/da6 1 Currently unreadable (pending) sectors
CRITICAL..... Device /dev/da6 1 offline uncorrectable sectors.

I need to get these replaced, however, its it safe, or even possible, to replace two at the same time ? Has anyone else been through a similar issue/process ?

Also, I won't be sending these drives back to the manufacturer as the server has been running without issues for over 6 years now so I consider myself quite lucky !!

Many thanks

L
I have been there before and done that.
It is possible to replace them both at once, but I had a full backup server with a live copy of the data if something catastrophic happened. The recommendation would be to replace them one at a time.


Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Ideally you have a free disk slot. That would allow you to replace in-place. Meaning you could replace a disk without loosing any redundancy you still have. Internally ZFS mirrors the failing disk with the replacement. Any data blocks that can't be obtained from the failing disk, are retrieved from the rest of the RAID-Z2. Whence the re-silver is complete, the failing disk is removed from the pool, (but the replacement disk is now in it's place).

This is a feature that is somewhat unique to ZFS. I've wanted this feature since 2000 when I got bit by 2 disks with bad blocks, (not the same), in a RAID-5. So I was screwed. Full backup, rebuild RAID-5 and restore.

But, if you don't have any free slots, then you have to perform a simpler disk replacement and hope you don't have a 3rd disk failure.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I have tested the replace of a drive in RAIDz2 and found that a replacement disk is resilvered much more quickly if the defective disk is first removed from the pool. If you leave the defective disk in the pool, it slows down the function of the resilver and can significantly delay the completion.
As da6 just has a single sector error, I would replace the da8 first. When that completes, you can replace da6. Since these are only 2TB drives, you should be able to complete the resilver in under 6 hours per drive.
 

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
Thanks for the replies.

Arwen, great idea. All my disks are connected via two IBM M1015. None of the SATA connectors on the Motherboard are being used so in theory I could...
  1. Buy two new disks
  2. Connect them up directly to the Motherboard
  3. Remove the old disks via the GUI
  4. Add the new ones in and let the resilver complete.
  5. Then physically replace the failed disks with the new ones
Sound about right ?

All my data is backed up to another server running a Linux based ZFS (Snapshots) so my data is safe.
 

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
I have tested the replace of a drive in RAIDz2 and found that a replacement disk is resilvered much more quickly if the defective disk is first removed from the pool. If you leave the defective disk in the pool, it slows down the function of the resilver and can significantly delay the completion.
As da6 just has a single sector error, I would replace the da8 first. When that completes, you can replace da6. Since these are only 2TB drives, you should be able to complete the resilver in under 6 hours per drive.

Good tip. I will bear that in mind. Many thanks
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Arwen, great idea. All my disks are connected via two IBM M1015. None of the SATA connectors on the Motherboard are being used so in theory I could...
  1. Buy two new disks
  2. Connect them up directly to the Motherboard
  3. Remove the old disks via the GUI
  4. Add the new ones in and let the resilver complete.
  5. Then physically replace the failed disks with the new ones
Sound about right ?
I don't think that is what Arwen was saying. There are basically two methods of replacing a faulty disk. I call them 'in place' or 'remove and replace'. In an 'in place' replacement, you would have both the new drive and the old drive online. In the GUI, there is an option for telling FreeNAS to replace an online drive with another drive that is in the system but not part of a pool. There is no need to offline or remove the defective drive because, when you tell FreeNAS to replace the drive, it attempts to copy data from the defective drive to the replacement drive first. When the resilver is complete, FreeNAS offlines the drive that was replaced. After the drive has been offlined by FreeNAS, then you would remove it physically from the system.

The method I prefer is to remove the defective drive completely (offline it in the GUI) and pull it out like a bad tooth. Then put the new drive in and tell the system to replace the missing drive with the new drive. I have found, from trying both methods and timing the results, that remove and replace is faster than the 'in place' method and I have guessed that the reason is that the system is not attempting to read from the defective drive. The CPU utilization will be high during resilver because the system is recomputing all the checksum data but as long as you have a strong CPU, the math involved in doing that will not slow down the process. The bottleneck should be the speed the replacement drive is able to write data to disk.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Fun fact: Allan Jude recently resilvered ten disks at a time (for expansion, so the pool was absolutely healthy, not that it would matter). That was a fun thing to see on IRC.
 
Status
Not open for further replies.
Top