SOLVED freenas box, multiple failing drives, and confusion

High Voltage · Sep 15, 2019

so, I have the same freenas box I built myself, and I KNOW I have failing drives from the original build of this thing when I had no idea what I was doing and believed I did because I did research on it before building it, total pool death per no realization hot spares were a thing later, and here we are.

so, I KNOW I have failing drives still from the original build on the same exact system, drives, hardware and all of it, however having learned hot-spares are a thing, I gave myself more time to figure out this imminent failure, and I am now officially at a loss here haha (all the data I lost originally is still very gone, so I'm in no actual danger of more lost data, only time, and I want to use this time to learn more for when this inevitably happens again in the future, so this is likely gonna die again, and I'm fully okay with that given I now have nothing else data wise I can lose given its all from my year outdated backup on external media from the first lost pool)

so, here's the hardware.

dual socket 1366 motherboard with dual x5660's
96GB 1333 ecc registered memory
2x dual 10Gb Fiber ethernet cards - freenas to client 10g connectivity
sas hba in IT mode per hardware recommendation on here for the hba (I forget off hand what it is, but I DO however remember its an lsi2008 chip in IT mode)
hp sas expander board, per finding one on ebay through someone salivating on them here on the forums (I forget what they are off hand but can easily find out given I have spares in a storage box
all of the system drives are plugged in using sas to sata fan out cables (because I cant afford sas drives in 3tb YET XD)

so, I have a total of 17 drives right now, after losing a handful prior to now, and I THINK I have now 4 failing on me >.> which is up from the 3 I knew about before today and this post being typed up.... but I am not totally sure here, and I want to make sure, that before I go and try and pull a drive that might be working, that I KNOW EXACTLY what ones to pull out of the system, and recycle, given I feel like I accidentally murdered about 6 working drives when I last tried to pull some, given that I THOUGHT I had pulled all my dud drives, only to be here again now... so THIS TIME, I want to make SURE I pull dud drives, and also want to make this post to find out how EXACTLY to be 100% sure you know what drives to pull, and also read someones suggestion of labeling the drives for their locations (given I have an internal 3x5 drive cage for the server chassis itself, and then an expansion chassis, so labeling whats where per that suggestion will be godly going forward...and well...given I'm gonna lose the array again it seems per this issue now seemingly getting worse, I want to make sure my time and effort on this is worth it, and I don't scrap good drives again given that may have unfortunately happened.

as for the storage pools layout, its nothing but mirrored vdevs, with spares for standby failures.

I use this thing for a lot, more than it should be used for honestly, and its both storage server, backup server, storage host for the occasional vmware disk image when I decide to do that (large vm size that wont fit on my limited internal disks is the only time this happens, or if I need hugely heavy io capacity for the vm that the single drives just cant do) and I also heavily use it for toy playing, such as plex tinkering and other odds and ends, so the speed is a huge bonus (given I also occasionally do media editing direct from the disk/storage of this thing) but the primary requirement is the data redundancy and safety of whats stored on it, capacity is also a big requirement, but out of those three, capacity is the lowest tier on the totem-pole for me

so, in the grade of importance : redundancy/safety of data -> speed of data and IO capacity -> and capacity last, but they are all rather important, which is why I have the ability to have directly plugged in to the thing, upwards of 20 plus drives

I should also note that I forgot to mention that I have 2 2tb wd reds for cache drives, out of the number I have right now internally.

PhiloEpisteme · Sep 15, 2019

Hi @High Voltage, sorry to hear you're having trouble. To be honest I'm not 100% sure what it is you're asking. What are you confused about and what can we help you wish specifically? From looking at your problem it seems that the best advice I can offer is to encourage better system maintenance practices so you don't end up with completely failed drives before you decide to take action.

High Voltage said:
given I'm gonna lose the array again

Properly done FreeNAS should be able to protect your data pending some catastrophic event. Folks who experience frequent and frustrating data loss have either experienced some catastrophic event such as flood, fire, etc or they have mismanaged their pool somewhere. There are three main steps to ensuring the integrity of your pools. The first is appropriate hardware such as using HBAs in IT mode, the second is proper setup such as avoiding single-disk vdevs or using less redundancy in vdevs than the data warrants, and the third if maintenance.

From what I'm reading it sounds like your hardware is appropriate, and you are using mirrored vdevs, so that is likely okay.

Also, it seems fairly likely that you may have skipped some steps and work on maintenance. Hot spares are useful but are not a replacement to running regularly schedule scrubs, short and long smart tests, and receiving reliable email notifications. In many cases you can replace a drive before it outright fails. If you have a disk reporting a lot of reallocated sectors you may consider replacing it before the whole pool goes down.

High Voltage said:
total pool death per no realization hot spares were a thing later, and here we are.

As I said, hot spares are great but they don't replace regular drive tests and monitoring. Unfortunately FreeNAS isn't a set it and forget it appliance. It is a set it and watch it to fix it when a disk starts going downhill kind of appliance.

High Voltage said:
so, I KNOW I have failing drives still from the original build on the same exact system, drives, hardware and all of it,

I'm not 100% what you mean by this. If you mean that you rebuilt your machine and reused failing drives, this is a mistake. Your data relies on stable and reliable disks in order to stay safe. If you build a pool out of known bad disks you're adding unnecessary risk.

High Voltage said:
I THINK I have now 4 failing on me >.> which is up from the 3 I knew about before today and this post being typed up.... but I am not totally sure here, and I want to make sure, that before I go and try and pull a drive that might be working, that I KNOW EXACTLY what ones to pull out of the system

Put simply you've got to map the physical drive to the pool. You can click the pool and select status which should give the drives in the pool by /dev/{device}p{partition} such as /dev/da2p2. smartctl -i /dev/da2 will give you the serial number and if you marked your drive bays by serial number you know exactly which are good and which are bad. You can confirm this by using zpool status will list every device in your pools and give their gptid/{ID} designation. That ID can be matched to the output of gpart list, look for rawuuid. Once you have that, the same output will tell you which device it is, such as /dev/da1. From there, you can use smartctl -i /dev/da1 as above.

So long as you have the correct S/Ns you know which drives to pull. If you did not label your drives that way you can shut the system down and check manually.

High Voltage said:
given I feel like I accidentally murdered about 6 working drives when I last tried to pull some, given that I THOUGHT I had pulled all my dud drives, only to be here again now.

Why do you think you pulled good drives? I would suggest never tossing a drive unless you know it is bad. Is it possible you just had more drives fail on you? If you properly label and document your drives as above you shouldn't be in a situation to mix your drives up.

High Voltage said:
2 2tb wd reds for cache drives,

You are referring to L2ARC devices, yes? You may want to look up the recommended sizes vs the amount of memory you have. If I recall, that can have an effect on the usefulness of the L2ARC.

Hopefully this helps some.

High Voltage · Oct 14, 2019

Thanks for your reply, sorry it took me so long to get back to this. I finally decided id had enough of these issues, and decided to finally go forward with the exact advice you gave of labeling drives and their bays, and upon doing that, and investigating the "bad drives" in seatools.... I'm left wondering if they were bad to begin with, or if somehow freenas had confused info on them, given all 4 had 100% clean bills of health, both from seatools, and all other hdd tests i ran on them from windows.

Regardless, all my bays are now labeled, and all drives cataloged by serial number for each respective drive slot therein.

If things try messing up again, I'll know exactly what's happening where this time.

I didn't want to do that till now, given my fear that id somehow make things worse by possibly pulling the wrong drive given i had no documentation before now.

That took me far too long to finally do (years too long Haha)

PhiloEpisteme · Oct 14, 2019

High Voltage said:
I'm left wondering if they were bad to begin with, or if somehow freenas had confused info on them, given all 4 had 100% clean bills of health, both from seatools, and all other hdd tests i ran on them from windows.

I have no idea what seatools is. I also can't recommend using windows to monitor your ZFS drives. The best way to monitor your drives in your FreeNAS box is to use SMART tests and scrubs; both of which are available via FreeNAS directly.

It is probably a good idea to make sure you have regular scrubs set up, smart tests set up, and email notification properly configured.

Happy to hear you were able to get your stuff properly labeled etc.

High Voltage · Oct 14, 2019

PhiloEpisteme said:
I have no idea what seatools is. I also can't recommend using windows to monitor your ZFS drives. The best way to monitor your drives in your FreeNAS box is to use SMART tests and scrubs; both of which are available via FreeNAS directly.

It is probably a good idea to make sure you have regular scrubs set up, smart tests set up, and email notification properly configured.

Happy to hear you were able to get your stuff properly labeled etc.

Oh, they are seagate drives, and seatools is seagate's own testing program, the zfs pool is dead while i rebuild, so were it not for that i certainly would not have done that haha.

As for everything else, I am definitely going to make sure that happens.

Important Announcement for the TrueNAS Community.

SOLVED freenas box, multiple failing drives, and confusion

High Voltage

Explorer

PhiloEpisteme

Guru

High Voltage

Explorer

PhiloEpisteme

Guru

High Voltage

Explorer

Similar threads