Multiple drives all failing

pcmofo · Aug 13, 2014

Apparently I setup the RAID using the following command from the terminal instead of the GUI.

zfs create tank raidz2 /dev/ada0 /dev/ada1 /dev/ada2 etc...

pcmofo · Aug 13, 2014

Here is my SMART data after doing a long test on all of the drives. http://pcmofo.com/smarttestdata.txt

I started the scrub and the same drives, ada1 and ada4, immeditally started showing repairing in the status.

danb35 · Aug 13, 2014

Strange. Nothing in the SMART tests jumps out to be as being problematic, which I'd expect to see if there were problems with the drive. If you have a spare drive or two, I'd probably try replacing one of them (probably ada1) to see if it makes a difference, but I'm not too confident in that.

pcmofo · Aug 13, 2014

I did a bit more testing. The scrub is only about 30% done but already their are more errors than the previous scrub by at least a factor of 10. Ada1 now has 646k and ada4 now has 3.32k in the checksum. I followed the wires from the drives to the controller. Both 1 and 4 are on different controllers on the motherboard (sata2/sata3) with the exception that the previous failed drive, ada5 was on the same controller as 4.

I also got out the heat gun. The hottest thing in the case is the large heat sink on the mobo (middle of the picture) which is 130F. Everything else is 100F(ram) or less. I guess I will replace the drives one at a time tomorrow and see what happens next.

pcmofo · Aug 13, 2014

[root@tank] ~# zpool status -v data
pool: data
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 28.2G in 6h5m with 0 errors on Wed Aug 13 23:09:26 2014
config:

NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 1.02M
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada4 ONLINE 0 0 291K
gptid/14df0631-209a-11e4-9 ONLINE 0 0 0
ada6 ONLINE 0 0 0
ada7 ONLINE 0 0 0
errors: No known data errors

pcmofo · Aug 14, 2014

I did an additional scrub on the data last night after receiving those results. On the second scrub that completed this morning I had zero errors and zero repairs...

I decided to put the original ada5 Seagate drive back in the server as it is most likely not bad. I took the WD off line and I am re silvering now. When its done I will try copying a few gigs onto it then run another scrub.

pcmofo · Aug 15, 2014

Alright, here is the current status of my FreeNAS box.

I transferred about 500GB of data onto it with no issues, warnings or errors. I then ran a scrub without any issues or errors. All of the original drives are back in the box any everything is the exact same way it was a week ago before I had any issues.

As far as I can tell their are no issues with any of the drives SMART data

Since everything about the server checks out, this leads me to believe something happened that caused the system to temporarily flip out and think that drives were going bad. This is most likely,
1. A bad part of some kind, motherboard, psu etc.
2. A power source issue (I'm use a surge protector and a large pro APC UPS)
3. A heat issue (All fans are working and temps appear to be fine currently)

While I am relieved my system is currently working fine, I am concerned as to what caused the issue to prevent it from happening again. Anyone have any ideas as to what might have happened to my system? Any other tests I can perform to help figure it out?

TXAG26 · Aug 17, 2014

Your drives appear fine. I'd put it on either a motherboard controller issue or sata cable issue. It goes without saying, but I assume you have a good backup of this Raidz2 data? I think the Z68 chipset only has 6 SATA ports. Where did you get the additional 2 ports from since it looks like you're running 8 drives? I'd be concerned with any 3rd party chips used to add SATA to the motherboard. Also, if you suspect it's a hardware problem, I would recommend you stop reading/writing to the disks, this includes scrubs, until you can be 100% certain you've got the issue fixed. A flakey SATA controller or bad stick of ram can completely trash the array while performing a scrub. Just an FYI...

Might be time to look at upgrading to a board with a real HBA storage controller and ECC ram. The Intel ICH SATA ports work decently for small arrays, but once you start patch-working Intel SATA2, Intel SATA3, and some 3rd party SATA controller into one array, the chances of random issues showing up increases exponentially. Just my $0.02 cents though.

jgreco · Aug 18, 2014

pcmofo said:
Alright, here is the current status of my FreeNAS box.

I transferred about 500GB of data onto it with no issues, warnings or errors. I then ran a scrub without any issues or errors. All of the original drives are back in the box any everything is the exact same way it was a week ago before I had any issues.

As far as I can tell their are no issues with any of the drives SMART data

Since everything about the server checks out, this leads me to believe something happened that caused the system to temporarily flip out and think that drives were going bad. This is most likely,
1. A bad part of some kind, motherboard, psu etc.
2. A power source issue (I'm use a surge protector and a large pro APC UPS)
3. A heat issue (All fans are working and temps appear to be fine currently)

While I am relieved my system is currently working fine, I am concerned as to what caused the issue to prevent it from happening again. Anyone have any ideas as to what might have happened to my system? Any other tests I can perform to help figure it out?

Desktop grade parts are not really designed for 24/7 operation and it is possible you've used up or worn out some marginal part.

If you didn't find dust bunnies reproducing in your box, and all the fans seem clean and happy, the values from your SMART data seem to indicate that it was perhaps a little warm but not intolerably so.

If and when you start replacing parts, please refer to the hardware forum sticky to help guide you to the most appropriate parts. The Supermicro boards, for example, are designed to be placed in servers that run 24/7. Your typical motherboard manufacturer doesn't go for the most expensive parts possible because the PC market is insanely competitive and they need to sell products people are willing to buy. So on a range of parts with various qualities, they typically pick one with the intention to run maybe 8 hours a day 5 days a week. Supermicro probably pays a little bit more for parts that are suitable for 24/7 operation.

Also sad but true, PC's may sometimes require disassembly and reseating of boards etc.

But quite frankly maybe it was just lonely and wanted some time out of the solitary confinement closet. Sometimes there's no obvious explanation.

So a few things to think about, though:

1) Labeling by device name (adaX) is really bad. Correcting that is a bit of a PITA though.

2) Actually set up some SMART tests to run! I do a short every 4 hours and a long 3 times a week on the filers here and this seems to be good at catching failing disks.

3) Consider that your disks are aging and are probably closer to the end of their service life than the beginning. You could consider this an opportunity. If your data is important, it should really be backed up, and so if that's a consideration, maybe there's an opportunity somewhere to make a new filer and then use the current one as a replication target or something like that.

4) Before you put the filer back online, be sure to run memory tests on it "just to be safe."

pcmofo · Aug 18, 2014

Thanks for the reply guys. My motherboard does have 8 built in SATA ports and I am not using any PCI cards or expansion (as seen in the image) I moved the server out of the closet just incase that was an issue, It's already nearly silent so thats not really an issue. Their was a bit of dust on the intake filters but I clean that out every few months anyway. Could be the dust + closet + heavy traffic for 8+ hours = some overheating and things got wacky. I ran MemTest on it for a few days and everything checked out.

I added some SMART testing and it appears to be running fine now. I might up the tests to a more frequent schedule though. I can say that I am already planning my next build with Supermicro. I got a quote from iXsystems and it was a bit high compared to what I think I can build myself. I'll most likely end up with a single Xeon, 32GB ram, 6x 4TB hdd, in a 2U case. If I get the right motherboard then I should be able to expand to more drives in the future... unless I find a good deal on a 24x drive case... A 6x Z2 vdev seems to be ideal for most cases.

Important Announcement for the TrueNAS Community.

Multiple drives all failing

pcmofo

Explorer

pcmofo

Explorer

danb35

Hall of Famer

pcmofo

Explorer

pcmofo

Explorer

pcmofo

Explorer

pcmofo

Explorer

TXAG26

Patron

jgreco

Resident Grinch

pcmofo

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Multiple drives all failing

Explorer

Explorer

Hall of Famer

Explorer

Explorer

Explorer

Explorer

Patron

Resident Grinch

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Multiple drives all failing"

Similar threads