RAIDZ1 pool failure
I’ve had a problem with hard drives dying in my TrueNAS SCALE box. It’s been an ongoing problem for some time and I’ve never really known what’s caused it but now I need to get to the bottom of the issue.
I have an LSI SAS9211-8i that I flashed to IT mode (I don’t have the firmware version in front of me but I can get it if needed). That is connected to three Seagate 8Tb Iron Wolf drives and to four mixed 1Tb drives that are in a hot swap housing. The three 8Tb drives were in a RAIDZ1 pool (yes I know not even close to ideal but its what I have to work with). The four mixed 1Tb make up a separate RAIDZ1 pool. The cabling is one CABLEDCONN Mini SAS SFF-8087 to SATA 90 degree on the 8Tb, and a Cable Matters on the 1Tb. The CPU and motherboard have been changed out and the issue continues.
The actual issue is that suddenly a drive will get a bunch of errors and the array becomes degraded. Sure it happens right? But I have been feeding the server a pretty steady diet of these 8Tb drives. A drive usually doesn’t last more than a few months. I haven’t known how to properly describe the issue and I have shied away from posting about it. Well now the worst has happened and I lost a second drive before I could deal with the first degraded drive and my pool is likely dead.
What is causing me to keep killing hard drives?
Is it the HBA?
The cabling?
Are the hard drives really dead/unreliable or is the system glitching for some reason and making the drives only seem bad?
I don’t know how to troubleshoot this but I very much need some help before I call in an airstrike on my server. For what it’s worth, I haven’t had this issue on the 1Tb pool. Only the 8Tb.
Here's some of the error text I received via email:
New alerts:
I’ve had a problem with hard drives dying in my TrueNAS SCALE box. It’s been an ongoing problem for some time and I’ve never really known what’s caused it but now I need to get to the bottom of the issue.
I have an LSI SAS9211-8i that I flashed to IT mode (I don’t have the firmware version in front of me but I can get it if needed). That is connected to three Seagate 8Tb Iron Wolf drives and to four mixed 1Tb drives that are in a hot swap housing. The three 8Tb drives were in a RAIDZ1 pool (yes I know not even close to ideal but its what I have to work with). The four mixed 1Tb make up a separate RAIDZ1 pool. The cabling is one CABLEDCONN Mini SAS SFF-8087 to SATA 90 degree on the 8Tb, and a Cable Matters on the 1Tb. The CPU and motherboard have been changed out and the issue continues.
The actual issue is that suddenly a drive will get a bunch of errors and the array becomes degraded. Sure it happens right? But I have been feeding the server a pretty steady diet of these 8Tb drives. A drive usually doesn’t last more than a few months. I haven’t known how to properly describe the issue and I have shied away from posting about it. Well now the worst has happened and I lost a second drive before I could deal with the first degraded drive and my pool is likely dead.
What is causing me to keep killing hard drives?
Is it the HBA?
The cabling?
Are the hard drives really dead/unreliable or is the system glitching for some reason and making the drives only seem bad?
I don’t know how to troubleshoot this but I very much need some help before I call in an airstrike on my server. For what it’s worth, I haven’t had this issue on the 1Tb pool. Only the 8Tb.
Here's some of the error text I received via email:
New alerts:
- Device: /dev/sdc [SAT], Read SMART Error Log Failed.
- Device: /dev/sdb [SAT], failed to read SMART Attribute Data.
- Device: /dev/sdc [SAT], failed to read SMART Attribute Data.
- Pool main_vault state is UNAVAIL: One or more devices are faulted in response to IO failures.
The following devices are not healthy:- Disk ST8000VN004-2M2101 WSD9M389 is UNAVAIL
- Disk ST8000VN004-2M2101 WSD9XXZ0 is FAULTED
- Device: /dev/sdc [SAT], Read SMART Error Log Failed.