Multiple drives all failing

Status
Not open for further replies.

Robert Smith

Patron
Joined
May 4, 2014
Messages
270
nas2.jpg

What are those SATA cables you are using? I like how they look.
I also, hope, they are not the source of the problem, LOL...

I tried Rosewill RCAB-11044 cables, but they have rounded parts on the connectors that prohibit plugging two cables parallel to each other on the motherboard... I am currently using Vantacor cables, but they are nothing more than regular flat cables with nice sheath over them, so they do not flex much.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
Looks like simply regular cables that are striped so you can see the individual data lines.
 

Robert Smith

Patron
Joined
May 4, 2014
Messages
270
Looks like simply regular cables that are striped so you can see the individual data lines.

You mean, somebody took a knife to regular flat cables and took the outer covering off? That sounds like a safe thing to do, nothing could go wrong. LOL
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
You mean, somebody took a knife to regular flat cables and took the outer covering off? That sounds like a safe thing to do, nothing could go wrong. LOL
I'm saying it came pre-manufactured like that. It's nothing hard, everyone does this with USB cables to come up with their own damn overpriced "proprietary" connector (I'm looking at you, Apple).
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
I wanted short cables that would not restrict air flow. I ordered these from Newegg. http://www.newegg.com/Product/Product.aspx?Item=N82E16812123165

My next FreeNAS box will have to be a little more robust. Should I be using server grade everything? (Mobo, psu, ecc memory, processors, case etc) or should I just not expect consumer grade hardware to last more than 2 years?
Consumer grade hardware will easily last more than 2 years. The problem is more about mission-critical redundancy features that typically do not exist in consumer-grade stuff like ECC, TLER, etc..
Additionally, manufacturers typically do not guarantee reliability of their non server-grade stuff when run in 24/7 configuration.
 

pcmofo

Explorer
Joined
Mar 2, 2012
Messages
98
I found no errors in memtest. I tried booting in linux but I apparently forgot how to access the drives in the command line so I just booted back into freeNAS. My nas is now NOT in degraded status but has checksum errors on both drives.

[root@tank] ~# zpool status
pool: data
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub in progress since Sun Aug 10 17:18:19 2014
2.72T scanned out of 6.00T at 334M/s, 2h51m to go
12K repaired, 45.24% done

config:

NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 49 (repairing)
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada4 ONLINE 0 0 34
gptid/14df0631-209a-11e4-97b8-5404a6da6622 ONLINE 0 0 0
ada6 ONLINE 0 0 0
ada7 ONLINE 0 0 0

errors: No known data errors
 

pcmofo

Explorer
Joined
Mar 2, 2012
Messages
98
And now for the really long parts... the smartctl command now works...

I put them in a txt file to make them easier to read...

First two are the two drives that reported errors... next one is a random drive in the pool that has yet to report an issue. and the last is the new WD drive.

http://pcmofo.com/raidsmartctrl.txt
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
It looks like you aren't running smart tests on any of your drives and temps are way way to hot.
 

vikingboy

Explorer
Joined
Aug 3, 2014
Messages
71
I had this same issue a few years ago on a hardware raid server based machine. Turned out my PSU was the cause of my issues, somehow it's was nuking drives. For the sake of prevention of any more dramas and the risk of losing data, I'd install a new PSU just to remove that possibility.
I suspect the sensible thing to do finances allowing would be to get this array stable and then don't subject it to heavy use until you have a backup of the data. Now might be a good time to get a server grade based array built, stable and tested and get the data off this one.
Good luck.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, SMART data looks fine, but you've broken so many of our cardinal rules I'm just going to put them out there and let you do what you want.

1. You are using RAID with ZFS, this is a major no-no.
2. As a result of #1 you have no ability to monitor SMART on the disks when in operation, this is another major no-no.
3. As an extention of #2 you are not doing any SMART tests. Some of your disks have no SMART testing ever performed while others have 10k+ hours since the last test.
4. You didn't provide the output of SMART for all of your disks
5. As you can see from the zpool output ada1 and ada4 have checksum errors. This is typical of a failing disk(s) behind a RAID controller which is then unreported.
6. You did NOT create this pool properly from FreeNAS. FreeNAS does not use whole disks and you should NOT be using whole disks in your pool.

Anyway, my work here is basically done. As I said in my first post in this thread, "To be blunt, the fact that you admit to using RAID makes me wonder what other "thumbrules" you ignored when setting up your server" and you've definitely made multiple major ones. I don't know what else to say except that you need to stop and go back to the drawing board with your setup. Clearly you didn't follow many of our best practices, and there's probably still more you are doing wrong that we aren't even aware of yet. In any case good luck with your server.
 

pcmofo

Explorer
Joined
Mar 2, 2012
Messages
98
Well, SMART data looks fine, but you've broken so many of our cardinal rules I'm just going to put them out there and let you do what you want.

1. You are using RAID with ZFS, this is a major no-no.
2. As a result of #1 you have no ability to monitor SMART on the disks when in operation, this is another major no-no.
3. As an extention of #2 you are not doing any SMART tests. Some of your disks have no SMART testing ever performed while others have 10k+ hours since the last test.
4. You didn't provide the output of SMART for all of your disks
5. As you can see from the zpool output ada1 and ada4 have checksum errors. This is typical of a failing disk(s) behind a RAID controller which is then unreported.
6. You did NOT create this pool properly from FreeNAS. FreeNAS does not use whole disks and you should NOT be using whole disks in your pool.

Anyway, my work here is basically done. As I said in my first post in this thread, "To be blunt, the fact that you admit to using RAID makes me wonder what other "thumbrules" you ignored when setting up your server" and you've definitely made multiple major ones. I don't know what else to say except that you need to stop and go back to the drawing board with your setup. Clearly you didn't follow many of our best practices, and there's probably still more you are doing wrong that we aren't even aware of yet. In any case good luck with your server.

I wasn't aware that I was breaking any rules, I thought I followed the guides and video tutorials pretty well but maybe I missed something or their is a miss-communication.

First, here is a full log of all of the smart data from all 8 drives. http://pcmofo.com/nassmart.txt

1. I'm using RAID with ZFS, I guess I don't understand this statement. I am not using hardware RAID controllers. In my previous FreeNAS box I had a RAID5 setup, in this box, instead of a RAID5, I setup a RAIDz2, which, from my understanding is one of the benefits of ZFS. If I am missing something here I would love to know because I am under the impression that this is correct.
2. I thought I had the ability to monitor SMART on the disks. I see "SMART" enabled on each disk in the GUI. I have received SMART checksum warnings in the past when I had a bad stick of ram. Why is SMART not running?
6. Again, I have no idea what I did wrong here. I have 8 disks, as far as I know I created a single pool with RAIDz2. I mean I can pull up any data you would like about my system but I dont understand what you mean by using whole disks, or how or why I would do something different.

I'd really like to fix my NAS and get on the right track here. If I have to rebuild the entire system, replace hardware, or buy all new hardware, then thats what needs to be done. I have about 6TB of data on the NAS now. I would appreciate some suggestions as to how I can resolve my current issues and how best to reconfigure my current box or rebuild a new box.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
I found no errors in memtest. I tried booting in linux but I apparently forgot how to access the drives in the command line so I just booted back into freeNAS. My nas is now NOT in degraded status but has checksum errors on both drives.
Just like to point this out since no one seems to have caught it (the particular line I've bolded).
It was actually a good thing you didn't know how to because you would've done some really terrible things that are irreversible. I wouldn't try to mount that pool in anything other than FreeNAS. Most other OS's can't even read ZFS and trying to do so can be very destructive (would probably end up corrupting your partition tables).
 

pcmofo

Explorer
Joined
Mar 2, 2012
Messages
98
Just like to point this out since no one seems to have caught it (the particular line I've bolded).
It was actually a good thing you didn't know how to because you would've done some really terrible things that are irreversible. I wouldn't try to mount that pool in anything other than FreeNAS. Most other OS's can't even read ZFS and trying to do so can be very destructive (would probably end up corrupting your partition tables).
I was trying to read the smart data from the drives and FreeNAS would not let me. I knew I didn't want to mount the drives or read the data, I was hoping just to be able to read the SMART data.
 

pcmofo

Explorer
Joined
Mar 2, 2012
Messages
98
Well, SMART data looks fine, but you've broken so many of our cardinal rules I'm just going to put them out there and let you do what you want....

I did some more digging. I checked on the SMART service for the NAS and found that it is ON and set to 30min, Power Mode: never, check regardless of power mode, difference 10, information 0, critical 0. and my email address. I assume the smart error emails I have received in the past were generated here but I could be wrong, this window seems to indicate it only cares about temps and not any SMART self testing. Why else would these self tests not be taking place regularly?

I have a RAID setup in front of the ZFS, i.e. , the devices presented to ZFS are actually RAIDs created by my disk controller in AHCI mode.

It is my understanding that ZFS is meant to use whole disks. http://docs.oracle.com/cd/E19253-01/819-5461/gazdp/index.html

When I setup this box I elected not to use part of each disk for a swap because it is not needed if you have enough RAM which I do at 1GB per TB, 16GB ram for 16TB of raw storage space (8x 2TB).
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
I did some more digging. I checked on the SMART service for the NAS and found that it is ON and set to 30min, Power Mode: never, check regardless of power mode, difference 10, information 0, critical 0. and my email address. I assume the smart error emails I have received in the past were generated here but I could be wrong, this window seems to indicate it only cares about temps and not any SMART self testing. Why else would these self tests not be taking place regularly?

I have a RAID setup in front of the ZFS, i.e. , the devices presented to ZFS are actually RAIDs created by my disk controller in AHCI mode.

It is my understanding that ZFS is meant to use whole disks. http://docs.oracle.com/cd/E19253-01/819-5461/gazdp/index.html

When I setup this box I elected not to use part of each disk for a swap because it is not needed if you have enough RAM which I do at 1GB per TB, 16GB ram for 16TB of raw storage space (8x 2TB).
That line right there would've been avoided if you'd read the stickies more carefully.
ZFS is designed to work with as few layers of abstraction in between the OS and the disk controllers.
The SMART errors could very well be due to FreeNAS not being able to access your disks because I'm guessing that you've created RAID 0's out of those disks?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
this window seems to indicate it only cares about temps and not any SMART self testing. Why else would these self tests not be taking place regularly?
The SMART service periodically monitors the SMART data from your drives (assuming it's available) and warns if things like offline sectors or temperatures go out of spec. It does not schedule self-tests; those are scheduled elsewhere in the GUI. Please, RTFM--this is discussed in some detail there (see section 8.10).
 

pcmofo

Explorer
Joined
Mar 2, 2012
Messages
98
The SMART service periodically monitors the SMART data from your drives (assuming it's available) and warns if things like offline sectors or temperatures go out of spec. It does not schedule self-tests; those are scheduled elsewhere in the GUI. Please, RTFM--this is discussed in some detail there (see section 8.10).
Thanks for the help. I'm not sure how I missed this when setting up the box. I set up a daily short smart test for all of the drives and a weekly long test. This should help in the future. Is that frequent enough?

I have two drives that reported checksum errors, what's my next step?
Run a scrub again and see if I can fix them?
Replace those drives right away?

From my Smart test that I ran earlier, do I have any other drives that might be on their way out? I'd like to get my server back into working condition and resume my data transfers soon. Eventually I will build a new server (with ECC and other server grade hardware) and use this server as the backup.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I'm afraid I still don't understand how your drives are set up, though I'm not 100% sure it's related to your immediate issues. For best reliability and performance, your drives should be connected to SATA ports on your motherboard or to an HBA. No RAID controller should be used. Any RAID options in your BIOS should be disabled. The only place where any sort of RAID anything should be happening is within FreeNAS. I'm not quite sure if this is how your drives are set up or not, but it should be.

I'm also not sure how you set up your pool in the first place, because it doesn't look like a normal FreeNAS pool. Ordinarily, FreeNAS pools just use gptid to identify member disks, while all but one of yours is identified by the device name (ada0, ada1, etc.). Ordinarily, FreeNAS also partitions the disk, reserving 2 GB (by default) for swap and using the rest for data. Your pool doesn't do this either. Neither of these issues is likely to be causing your immediate problems, but they point to your not having done things in the recommended/supported way. The more things you do this way, the harder it is for anyone here to help out.

Your SMART test schedule sounds reasonable, though I'd probably kick off a long test for all of your disks immediately. Once it's complete, pull the SMART data for all your disks again. Right now, none of them are showing indications of failing (I normally look at IDs 197 and 198, and these are both 0 on all of your disks), but a long SMART test should test the entire disk surface and might show up some trouble. If none of the SMART tests fail, and no offline sectors appear following them, try another scrub and see what it says. I don't remember if you've already tried replacing the cables for the affected drives.

I note that all your drives have seen higher temps than recommended (recommended is <= 40 deg C), though they're all currently within the recommended range.
 

pcmofo

Explorer
Joined
Mar 2, 2012
Messages
98
I'm afraid I still don't understand how your drives are set up, though I'm not 100% sure it's related to your immediate issues. For best reliability and performance, your drives should be connected to SATA ports on your motherboard or to an HBA. No RAID controller should be used. Any RAID options in your BIOS should be disabled. The only place where any sort of RAID anything should be happening is within FreeNAS. I'm not quite sure if this is how your drives are set up or not, but it should be.

I'm also not sure how you set up your pool in the first place, because it doesn't look like a normal FreeNAS pool. Ordinarily, FreeNAS pools just use gptid to identify member disks, while all but one of yours is identified by the device name (ada0, ada1, etc.). Ordinarily, FreeNAS also partitions the disk, reserving 2 GB (by default) for swap and using the rest for data. Your pool doesn't do this either. Neither of these issues is likely to be causing your immediate problems, but they point to your not having done things in the recommended/supported way. The more things you do this way, the harder it is for anyone here to help out.

Your SMART test schedule sounds reasonable, though I'd probably kick off a long test for all of your disks immediately. Once it's complete, pull the SMART data for all your disks again. Right now, none of them are showing indications of failing (I normally look at IDs 197 and 198, and these are both 0 on all of your disks), but a long SMART test should test the entire disk surface and might show up some trouble. If none of the SMART tests fail, and no offline sectors appear following them, try another scrub and see what it says. I don't remember if you've already tried replacing the cables for the affected drives.

I note that all your drives have seen higher temps than recommended (recommended is <= 40 deg C), though they're all currently within the recommended range.
Thanks for those details. I can confirm that I am connecting the drives directly to the 8 sata ports on the motherboard. I am also using them in AHCI mode with no raid or other hardware in between. I am looking for the scripts I used to setup the raid and will post them as soon as I can.

I also started a long test on all the drives per your suggestion. They should be finished in about 5 hours. I ordered two Red drives and will use those to replace anything as needed or order more. As soon as the test is finish I will post the SMART data then clear the zpool status and start a new scrub.
 
Status
Not open for further replies.
Top