All drives in pool showing FAULTED after resilvering?

BitHappy

Cadet
Joined
Jul 26, 2021
Messages
6
Hey I'm hoping someone has some info on what could have happened here? My pool of ssd's was showing "degraded" a few days ago so I ordered a new drive of the same size and replaced the faulted drive last night. Left it to resilver over night and when I checked on it again after work this even it was still showing "degraded"? But when I took a closer look, it was now saying that every single drive in the pool had faulted with like 24 checksum errors? Which I believe is the same error and number that it was showing on the original drive before I replaced it. :/ I've started a scrub of the pool just to make sure it's not a mistake of some sort, but it will be a little bit before it finishes so I wanted to ask here in the meantime.

I mean it's not really likely that every drive is suddenly bad right? Should I start trying to backup the data right now? Can I even access the drive with that many faulted drives? I've never tried with more than one bad drive. Is there something I could have done wrong somehow in the replacement process? Any help would be much appreciated!

Thank you!
 

BitHappy

Cadet
Joined
Jul 26, 2021
Messages
6
Oh wait no.. after looking at it again it turns out that all but ONE is degraded. And thats the new drive that I just put in... That one says ONLINE, while the others all say DEGRADED.. Am I screwed? If I put the old bad drive back in will it put everything back to normal? Or just make things worse? I would have assumed that the other drives don't change at all during the resilvering process of a new drive, but maybe they do? If I put the old drive back in and restart the system would it just recognize everything as exactly how it was? or would I need to resilver all these drives now or something weird? Is that even possible to put in new drives now that there's more than one that are bad? Ugh things went a lot smoother last time I replaced a drive in this pool haha..

Speaking of which, I'm not sure this is related at all, but while I'm thinking about it, this is the second time that a drive has faulted in this pool which is made up of all ssd's, and my other two pools that are all hdd's have never had an issue once yet. is that normal? I kindof expected the ssd's to last longer than the hdd's but I suppose I never really thought it through exactly.

PS the scrub completed and the results appear the same. :/ Here's a screenshot just in case.
Degraded_Pool_Status.png
 
Joined
Nov 25, 2022
Messages
7
with software raid, there is write amplification. meaning when you write block of data calculations are done and written across multiple disks plus journaling and other house keeping is done. If you are running ssds, check the endurance rating on them, and check the tbw rating and the life left on them. You will find that most software systems burn ssds fast, unless you get high endurance ssds rated for data storage appliances. Intel has probably the highest rated endurance ssds but they cost a lot....

This is not like storing a file on a simple disk. This is why I myself am strongly considering truenas as my backup system, leveraging the snapshot feature primarily, and UNRAID as the main "nas" data store.

Since unraid is fundamentally different, really all it is is a collection of loose disks with mere parity. Its not as robust as zfs in some ways, but its simpler. The write amplification is much less. Its just like writing data to your c drive plus parity.

Also, you setup a flash ssd pool as your write cache in unraid and max out your NIC with cheap consumer grade flash, without challenging the endurance.

Also, you can spin down individual disks. And only the disk containing your files spin up as needed. So if you have 6 drives, maybe only 1 spins up. This saves you a bunch power and wear and tear. It is slower though. But you can setup variable spin down settings. So your disks are active during high traffic hours, and spin down after 15 min of inactivity at night, if you setup a simple script.

I am really impressed with what I am reading about it. The only thing I was excited about with truenas with the snapshot feature, this works extremely well with truenas. But everything else is a drawback for our simpleton usecase.
 

zhe

Dabbler
Joined
Nov 28, 2022
Messages
24
1 do not build Raid by SSD excpt RAID0 , because SSD change to small for using long time not like HD
2 run cmd #smartctl -H <disk> every disk all check pass
3 i will test run cmd #zpool clear <disk> to online , take care please
 

BitHappy

Cadet
Joined
Jul 26, 2021
Messages
6
with software raid, there is write amplification. meaning when you write block of data calculations are done and written across multiple disks plus journaling and other house keeping is done. If you are running ssds, check the endurance rating on them, and check the tbw rating and the life left on them. You will find that most software systems burn ssds fast, unless you get high endurance ssds rated for data storage appliances. Intel has probably the highest rated endurance ssds but they cost a lot....

This is not like storing a file on a simple disk. This is why I myself am strongly considering truenas as my backup system, leveraging the snapshot feature primarily, and UNRAID as the main "nas" data store.

Since unraid is fundamentally different, really all it is is a collection of loose disks with mere parity. Its not as robust as zfs in some ways, but its simpler. The write amplification is much less. Its just like writing data to your c drive plus parity.

Also, you setup a flash ssd pool as your write cache in unraid and max out your NIC with cheap consumer grade flash, without challenging the endurance.

Also, you can spin down individual disks. And only the disk containing your files spin up as needed. So if you have 6 drives, maybe only 1 spins up. This saves you a bunch power and wear and tear. It is slower though. But you can setup variable spin down settings. So your disks are active during high traffic hours, and spin down after 15 min of inactivity at night, if you setup a simple script.

I am really impressed with what I am reading about it. The only thing I was excited about with truenas with the snapshot feature, this works extremely well with truenas. But everything else is a drawback for our simpleton usecase.

Ok so using ssd's in raid is a bad idea in general then, definitely good to know! That makes sense then why I've had a disproportionate amount of faults in that pool then rather than the others, but surely these cant really have all faulted at once right?

1 do not build Raid by SSD excpt RAID0 , because SSD change to small for using long time not like HD
2 run cmd #smartctl -H <disk> every disk all check pass
3 i will test run cmd #zpool clear <disk> to online , take care please

Alright so you think I should make sure every disk passes and then just clear the errors? Is that really safe to do? Will it still tell me if there's a smiliar fault down the line or something?

If I replace the drive I just bought with a new-new drive that is a different brand and then resilver again, will that make things worse you think?

Does the resilvering process change the other drives in some way? Or does it just write the information to the new drive? Like if I reattached the old drive again would everything go back to how it was yesterday before I swapped drives? Or would I just now have 6 faulted drives instead of 5?

Thanks again for everyone's help!
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
You have a bit too many questions for me to answer, but here are some:
  • Don't put back in the old SSD
  • As long as your pool has zero / 0 errors, you have not lost data
  • Re-silvering does need to read other drives, so if it finds a problem, ZFS has to report it / them
  • The next drive to investigate / potentially replace, is "sdi", with the read errors.
  • If possible, perform replace in place, ESPECIALLY on RAID-Z1
The last bit of advice is something that is mostly unique to ZFS. You can add the replacement drive to the NAS, and then tell ZFS to replace an existing drive. This allows both the drive being replaced and the other pool members to supply any redundancy.

Of course, if you don't have the ability to add an additional drive, then it's understandable you have to pull an existing drive before putting in the replacement. This particular point is one I designed into my new NAS, an extra disk slot for exactly this reason.
 
Top