Noobie Experiencing Possible Drive Failure?

Status
Not open for further replies.

Macaroni323

Explorer
Joined
Oct 8, 2015
Messages
60
I've been running FreeNAS for a few years as a home server with no real issues until today, when I have received two warnings:

CRITICAL: Dec. 16, 2018, 4:17 a.m. - Device: /dev/ada1, 2 Currently unreadable (pending) sectors
CRITICAL: Dec. 16, 2018, 4:05 a.m. - The volume rixpool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

I am trying to figure out exactly what is happening to the array. From the first warning, I think it is saying the drive ADA1 is failing with 2 sectors on the drive not being readable (though I first read it as both ADA1 and 2 are failing with bad sectors). From the second warning, I think the data is OK since the pool is "ONLINE" (I'm confirming my backup of the array currently). I'm just looking for some confirmation of my interpretation.

I have four, 4TB Western Digital "Red" hard drives installed forming a total storage size of 8TB. From the "Reporting" webpage I notice that ADA0 and ADA1 have similar "Disk I/O" activity, as well as ADA2 and ADA3 have similar "Disk I/O" activity. I would assume that ADA0 and 1 are mirrored as 4TB and ADA2 and 3 are mirrored as 4TB of the available 8TB. In the "Disk Latency" chart there is a different story. ADA2 and AD3 have similar latencies but ADA0 and ADA1 are very different (especially at 4:05am). ADA1 showed a 15 second latency around the time of the indicated critical warning (also a 2.5 second latency around 2 hours later). I think this just confirms ADA1 disk is having "issues".
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I am trying to figure out exactly what is happening to the array. From the first warning, I think it is saying the drive ADA1 is failing with 2 sectors on the drive not being readable (though I first read it as both ADA1 and 2 are failing with bad sectors). From the second warning, I think the data is OK since the pool is "ONLINE" (I'm confirming my backup of the array currently). I'm just looking for some confirmation of my interpretation.
Yes. the disk /dev/ada1 is failing with 2 Currently unreadable (pending) sectors and you should plan to replace it with one of your burned-in coold-spares. You do have cold-spare disks on hand don't you? That you have already done burn-in testing on, so they are ready to use at any moment.
I have four, 4TB Western Digital "Red" hard drives installed forming a total storage size of 8TB. From the "Reporting" webpage I notice that ADA0 and ADA1 have similar "Disk I/O" activity, as well as ADA2 and ADA3 have similar "Disk I/O" activity. I would assume that ADA0 and 1 are mirrored as 4TB and ADA2 and 3 are mirrored as 4TB of the available 8TB.
Presumably, you set the system up, how is it that you don't know what the configuration is?
In the "Disk Latency" chart there is a different story. ADA2 and AD3 have similar latencies but ADA0 and ADA1 are very different (especially at 4:05am). ADA1 showed a 15 second latency around the time of the indicated critical warning (also a 2.5 second latency around 2 hours later). I think this just confirms ADA1 disk is having "issues".
Yes, failure confirmed. If you replace the disk before the mirror fails, you won't loose the entire pool. You only have one disk standing between you and total data loss, so do move with some haste to obtain a suitable replacement disk, get it burned-in and install the replacement. You will want to look at the serial numbers of the disks to ensure that you remove the correct disk when you are swapping out disks. In the webGUI you can see the disks all listed under, Storage > View Disks
https://www.ixsystems.com/documentation/freenas/11/storage.html#view-disks
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Macaroni323

Explorer
Joined
Oct 8, 2015
Messages
60
Wow... Thanks Chris for the quick reply... I guess I will need this "scare" to get me to have cold spares ready. ;-P I have ordered a single drive which will be here next in 3 days, but... I may change my order to 2 new drives for a cold spare. I have a WD external RAID1 array as a backup so I'm not too worried (right now). There is a tortured path to my 4, 4TB drives in the FreeNAS box. I started FreeNAS years ago with two 4TB WD Red drives in the array backed up to an external WD USB RAID array with 2 4TB (matching ones in my FreeNAS box). During expansion (from the earlier post) I moved the 2 4TB backup drives (which were identical to those in the array) from the external USB case to the FreeNAS array (one of those drives from the USB backup drive is the one failing and it's 6 months out of warranty -_- ). I then purchased 2 8TB drives to put in the WD USB RAID case and those are RAID1 and my new current backup. I do remember the post but I don't know how to get that view of the array on the current web page. I've looked under storage which only shows the users in the pool not the physical disks.

Thanks for the quick help.

Rick
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Macaroni323

Explorer
Joined
Oct 8, 2015
Messages
60
Excellent video... I guess now I remember that you need to select items on the webpage and watch the page change for other option/views. I believe had this problem when expanding the array about a year ago. I don't interact with the webpage often enough to learn the "loopholes". Your video helped perfectly... So good was the help that I think it shows I have mis configured my 4 users under 1 volume, when in your array I think you have a separate volume for each user of the server (correct me if I'm wrong). I will work on changing this (if this is in-fact true) after I get the array hardware back to correct operational status. My next question will be how to properly replace my failing ADA1 disk when I get my replacement drive ready.
 

Macaroni323

Explorer
Joined
Oct 8, 2015
Messages
60
Here's what I found. Looks like a checksum error on ADA1.
Screen Shot 2018-12-16 at 10.28.08 AM.png


Expanding "rixpool"
Screen Shot 2018-12-16 at 10.28.31 AM.png
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
So good was the help that I think it shows I have mis configured my 4 users under 1 volume, when in your array I think you have a separate volume for each user of the server (correct me if I'm wrong). I will work on changing this (if this is in-fact true) after I get the array hardware back to correct operational status.
No, those are not users. I have different pools for different storage needs, I just named them after people for personal reasons. In my configuration, the main storage is the Emily pool and all users have access to that. The Irene pool is a full copy of everything in the Emily pool but nobody can access it except me, from the command line, because it is not shared to the network. The Backup pool is another backup of what is in the Emily pool but it is shared to the network in read only mode so that users can fix their own mistakes without needing to bother me over accidental deletion or similar mishap.
There is nothing wrong with the way your system is configured and I am not advising you to change it. Except that you might want a spare disk, especially since the ones you have are getting a little older. I try to replace my disks around the five year mark, or earlier if they fail on me.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
PS. I just recently stopped using a 2TB drive that had been in use for around seven years. Still working perfectly. You can have drives that last well beyond the 5 year mark, but the likelihood (statistically) of a failure increases sharply after five years.
 

Macaroni323

Explorer
Joined
Oct 8, 2015
Messages
60
Good... Again thanks for the help. I may shut down the server for a few days until I have the new drives (I also need to find ADA1 drive from its serial number). I will report back after replacing the drive.

Rick
 

Macaroni323

Explorer
Joined
Oct 8, 2015
Messages
60
Just a post maintenance update... Thanks for all the help Chris. The system is back up to stable operation. Checked the SMART on my drives and the bad drive had a "Raw_Read_Error_Rate" of about 350 while two of the others had 0, and ADA2 is 18 (probably the next fail). I will watch these now more carefully. Instructions in the documentation on changing the drive worked perfectly (now that you mad me "wise" to the menu selections on the web interface.

Rick
 
Status
Not open for further replies.
Top