2 failed drives in my array. Seeking advice.

MeatTreats · Nov 17, 2022

So this is a continuation of sorts of issues documented in this thread. https://www.truenas.com/community/threads/3-bad-drive-all-from-the-same-vdev-please-help.98109/

Long story short, I have a 48 drive array with 3 vdevs consisting of 16x6TB, 16x6TB and 16x10TB (I know I should have made smaller vdevs) and I had one of the 10TB drives "fail" and then after replacing it with a spare and rebuilding the array, a couple days later overnight, two more of the 10TB drives "failed" and my array (which is Z3 btw) was left in a severely degraded state.

So I took all 3 drive to a local computer repair shop that does a lot of business with small and medium businesses and he tested the drives and they all came up fine. Their partitions showed up and their didn't appear to be any issues with the drives.

So at this point I am not completely sure why FreeNAS kicked the drives out of the array but a bad backplane, cable or some other piece of bad hardware could be the issue as this problem is only effecting the 10TB drives which are all the same and all in a row on the same backplane.

Others have suggested moving drives around to see if that is the issue and that is certainly a step I plan to take but for now my only concern is getting my array into a healthy state because if I lose one more drive, then I lose everything but since these drives aren't actually bad or at least not bad enough that I want to replace them, my questions are these.

1. What happens if I put the rejected drives back into the server and power up the system? (It has been powered off since, almost a year) Will FreeNAS automatically take the drives back into the array or do they have to be reinserted back into the array?

2. If there a way or command to force FreeNAS to take the drives back and what is the risk in doing that? Since the drives are good and already a part of the array, it is better to just add them back in rather than reformat them and rebuild which increases the risk of another failure.

Again, my goal here is to get my array healthy enough so I am no longer one drive away from failure and then I will move drives around so that some of the 6TB drives sit where the 10s are and monitor the system. If any of those 6TB drives get kicked out of the array, I know it is a hardware issue. I would still want to then be able to reinsert those booted drives back into the array rather than rebuild.

Whattteva · Nov 17, 2022

It probably will take the drives back assuming the drives are good and the backplane is actually good. Probably better for you to just skip the backplane and connect them directly or use another system cause you need to rule out something at this point before moving forward.

Man, good luck resilvering those drives though. Depending on how full your pool was, 3x 10 TB drives that need resilvering will take days probably even if you do NOTHING with the pool and it puts tremendous load on the surviving drives. Times like these is the reason why I recommend people to use striped mirrors. Resilvering is a breeze and the load on the pool is minimal.

MeatTreats · Nov 17, 2022

Whattteva said:
It probably will take the drives back assuming the drives are good and the backplane is actually good. Probably better for you to just skip the backplane and connect them directly or use another system cause you need to rule out something at this point before moving forward.

Man, good luck resilvering those drives though. Depending on how full your pool was, 3x 10 TB drives that need resilvering will take days probably even if you do NOTHING with the pool and it puts tremendous load on the surviving drives. Times like these is the reason why I recommend people to use striped mirrors. Resilvering is a breeze and the load on the pool is minimal.

I don't know how I would "skip the backplane". The drives were already pulled and tested on another system and they came back as good.

As for the resilvering, that is exactly what I am hoping to avoid and what this post is about. Since the drives are good (or good enough). I want FreeNAS to just reinsert them back into the array as is since they already have the stripes they need on them.

AlexGG · Nov 17, 2022

MeatTreats said:
If there a way or command to force FreeNAS to take the drives back and what is the risk in doing that? Since the drives are good and already a part of the array, it is better to just add them back in rather than reformat them and rebuild which increases the risk of another failure.

These drives are out of sync. I don't know if you can force them back. I don't think you can, but I am not sure. The risk is, I mean, most certainly, the data on the failed drives no longer matches what's on the healthy drives, and, during the resilver, the system already lost track of what is stored where on the failed drives.

MeatTreats · Nov 17, 2022

AlexGG said:
These drives are out of sync. I don't know if you can force them back. I don't think you can, but I am not sure. The risk is, I mean, most certainly, the data on the failed drives no longer matches what's on the healthy drives, and, during the resilver, the system already lost track of what is stored where on the failed drives.

I don't see how the data could be out of sync because at the time when the 2 "failed" drives were kicked out of the array, I was doing a scrub but no data transfers and I cancelled the scrub and shut down the system and it has been off ever since. So the data in the pool and on these drives hasn't changed.

Isn't there a command or something that performs the necessary checks and can then take the drives back or can just repair and resync them if necessary rather then do a full rebuild?

MeatTreats · Nov 18, 2022

So can anyone tell me how to reinsert those drives back into the array and if there is a command to force FreeNAS to take the drives back and what is the risk in doing that? Maybe @Ericloewe has some thoughts or can ping an expert?

My array is a Z3 array and I just want to make it healthier so I can figure out what the issue is and I would feel a lot better knowing that I don't have to resilver/rebuild my array every time a drive drops out.

Ericloewe · Nov 18, 2022

Well, zpool clear does what you want, if I understand you correctly. It's not particularly risky, but it can be, pessimistically, described as sticking your head in the sand if used improperly.

MeatTreats · Nov 18, 2022

Ericloewe said:
Well, zpool clear does what you want, if I understand you correctly. It's not particularly risky, but it can be, pessimistically, described as sticking your head in the sand if used improperly.

What does "used improperly" mean? If typing in the command as given, it should do what exactly?

Ericloewe · Nov 18, 2022

MeatTreats said:
What does "used improperly" mean?

Used blindly to pretend that there is no problem.

MeatTreats said:
If typing in the command as given, it should do what exactly?

It clears error counters and tells ZFS to give it another spin.

Davvo · Nov 18, 2022

If you decide to continue using them, place special attention to their smart data. This assumes you have regular short and long smart tests scheduled.

MeatTreats · Nov 18, 2022

Ericloewe said:
Used blindly to pretend that there is no problem.

It clears error counters and tells ZFS to give it another spin.

So does a system restart do typically to that or will the drives still be unavailable when I power the system back up?

If no, would I use that command and then restart my server to see all my drives in my pool?

What about the zpool online command?

Davvo said:
If you decide to continue using them, place special attention to their smart data. This assumes you have regular short and long smart tests scheduled.

Since the drives were kicked out of the pool and I panic shut down my server and have not used it in about a year, I did not run SMART tests or grab the SMART data but I did pull the drives and will put them in another system on Monday and and run those tests but what I have seen already is leading me to believe the drives are actually good but there is something else instead going on maybe a bad backplane. Once I am able to get the drives inserted back into the pool, I will shuffle drives around and see if some of my known good 6TB Toshiba drives start dropping out from the same backplane, that will confirm what I suspect but I want to be able to first make my pool healthy without having to resilver/rebuild whenever a drive is kicked/dropped from the pool.

ChrisRJ · Nov 18, 2022

MeatTreats said:
I don't know how I would "skip the backplane". The drives were already pulled and tested on another system and they came back as good.

I assume you are referring to the test done by a local computer shop, mentioned in your initial post. If what you described there (partitions were visible) is all that was done, it was a useless test. Being able to see partitions has zero meaning relative to "the drive being good". But perhaps more testing was done ...

Davvo · Nov 19, 2022

If you have concrete proof that the drives are good, try swapping only the PSU.
Again, if you don't share the smart data we are blind here.

MeatTreats · Nov 20, 2022

ChrisRJ said:
I assume you are referring to the test done by a local computer shop, mentioned in your initial post. If what you described there (partitions were visible) is all that was done, it was a useless test. Being able to see partitions has zero meaning relative to "the drive being good". But perhaps more testing was done ...

I will take the drives in for more extensive testing on Monday and get SMART data that I can post here. It wasn't just that the partitions were available alone but the drives were responsive and not exhibiting (at far as they can tell) any of the noticeable signs of a dying drive.

Davvo said:
If you have concrete proof that the drives are good, try swapping only the PSU.
Again, if you don't share the smart data we are blind here.

My server has 4 PSU's and I don't have spares.

ChrisRJ · Nov 20, 2022

MeatTreats said:
I will take the drives in for more extensive testing on Monday and get SMART data that I can post here. It wasn't just that the partitions were available alone but the drives were responsive and not exhibiting (at far as they can tell) any of the noticeable signs of a dying drive.

If all these activities show positive results, that is of course better than negative ones. But in no way does it prove that the drives are error-free. ZFS is way more sensitive than other file systems. Sensitive in the sense that other file systems show no error, although something is not ok, and ZFS reporting the latter. And the same applies to SMART tests. All the drive failures I have had over the years happened without SMART tests giving any upfront warnings.

Something is likely wrong with the drives. I know that I may seem to be somewhat paranoid here. But this attitude has helped me to not loose data over more than 30 years. Therefore I would personally simply treat those drives as dead and get replacements.

MeatTreats · Nov 21, 2022

@ChrisRJ @Davvo

So I got SMART data for the drives that dropped out and for the one that was already replaced in the array. They all came back as good so I didn't bother doing a sector by sector test which would have taken an hour per drive. ZFS may be more sensitive than other file systems but these drives aren't in a condition that justifies being booted from the array unless there is a bigger hardware issue at play is my thinking.

Anyways. Here is the results. Sorry for the low quality.

Davvo · Nov 21, 2022

Mmmh... this is strange.
Were they all connected to the same PSU?
Are they from the same batch?
When was the last time you did a short or long smart test on each of those drives?
I would personally attach them to the TrueNAS system again and run a long smart test on every disk, then look again at those data.
If nothing is reported to be wrong, I would continue with the zpool clear and monitor them closely.

Edit: I would still buy replacement ASAP.

joeschmuck · Nov 21, 2022

MeatTreats said:
They all came back as good so I didn't bother doing a sector by sector test which would have taken an hour per drive.

That is a bad decision. I looked at the referenced thread and this one, it sounds like you might have an HBA issue.

But let us first rule out the hard drives. I do not trust any testing that someone else did and looking at the SMART values as you did above does not show all the data I need to see. So I'm asking you to do the testing in your computer yourself. These results we will trust.

First you need to post a few pieces of information (in [ code ] text [ /code ] brackets):
zpool status
glabel status
smartctl -a /dev/da? whereas the "x" id the drive number in question. And list all suspect drives.

After posting the above requested data, run the SMART Long Test. Why post the information above first vice waiting for the testing results?
Because that data is critical in telling us which drives are suspect and the pool status and we can look at the before testing data now vice waiting 20 hours. Maybe the issue will be obvious right away, maybe not.

Next, run smartctl -t long /dev/dax on each suspect drive. You can run these tests on all the drives at once. Do not power off or reboot the machine during the testing or it will not complete. Wait the number of minutes required for the tests to complete (1168 minutes or 19.5 hours). Then list the SMART data again using smartctl -a /dev/dax so we can see how things worked out.

Last question... What is the manufacture date on these suspect drives?

Something else I found related the the model of your hard drive:

With older power supplies you might find the WD100EMAZ will not show up at all in your BIOS or Windows. We had this issue on our test bench and found we required the 3.3V Pin mod to have the drives work. Just cover the 3rd pin with tape or non-conductive/ corrosive varnish to stop electrical current from passing.

So maybe you are having an issue with the drive not being recognized. Just some added information which may not be relevant as you stated the drives were working at one point in time, but it's best to know than not know.

Post the requested information and do not cut anything out. Some people think they know better and cut out vital information when they think it's not vital. You can remove the serial number if it makes you feel better but leave everything else.

MeatTreats · Nov 21, 2022

@joeschmuck

Thanks for the reply. Running those commands and getting that data will require that I power up my server. It had to happen sooner or later, right?

I will have to do this tomorrow since it is late and may not have everything you want until Wednesday depending on how long these tests take. When I put the drives back into my server and power it up, are the drives going to be reinserted back into the array or is that something I have to do with something like zpool clear?

To answer your manufacture date question, 2018 but I only bought them new in 2021. They are shucked from Easystores.

As for the 3.3v power issue, I am familiar with it. I have used 3 of these drives in my PC and had to cover that pin with tape in order for my PC to even recognize them but there are 16 of them in my server it didn't have that problem.

joeschmuck · Nov 22, 2022

MeatTreats said:
When I put the drives back into my server and power it up, are the drives going to be reinserted back into the array or is that something I have to do with something like zpool clear?

Okay, so you just added some complexity. I did not realize the drives had been removed and replaced. They should not become part of the pool since they have been removed, correctly I hope. Before installing the drives, look at the "zpool status" output, make sure there are no alarms (offline or degraded type messages) and everything should be Online and no errors.

After that if you have the drive bays to install one of those drives, we can test that drive out. Then check the "zpool status" again, verify it hasn't changed. If it hasn't then all is normal. You should be able to install the other two drives and check the "zpool status" again. If all is good then perform the steps above. The "zpool status" and "glabel status" were to help identify the drives in question when they are part of the pool, now it's just extra information to make sure all is good. The "smartctl" commands are the very relevant commands.

We can do this a different way as well and one I feel much better about. If you have the drives connected to a different computer, the smarctctl commands will work for a Linux or FreeBSD boot environment. In a Windows computer as well but you will need to install SMARTMONTOOLS (located here: https://www.smartmontools.org/ ) to perform the test, but the drive may need to be assigned a drive letter. Either way you can run the smartctl commands and post the outputs both before and after results. The Windows version command would be smartctl -t long x: if the drive was identified as "x:".

Important Announcement for the TrueNAS Community.

2 failed drives in my array. Seeking advice.

Dabbler

Wizard

Dabbler

Contributor

Dabbler

Dabbler

​

Server Wrangler

Dabbler

Server Wrangler

MVP

Dabbler

Wizard

MVP

Dabbler

Wizard

Dabbler

MVP

Old Man

Dabbler

Old Man

Similar threads