Hi All,
I've been happily using TrueNAS Scale for about a month now and I believe everything was setup correctly.
I've got 4 x 16TB Ironwolf drives and 1 x Ironwolf Pro drives to make a ZFS1 pool of 5, physically contained within a Silverstone FS305-12G 'caddy', which is then running off the onboard SATA controller on my motherboard Zenith Extreme for Threadripper (1 SATA cable per drive).
The 1 Pro HDD was bought from eBay and I did not run badblocks and simply verified runtime and size matched what the label said (I was impatient, lesson learnt). A month later I am having issues with playback in Plex etc and the GUI has thrown a critical warning that I have a degraded drive (it was the eBay drive - SDA2) and consequently the pool is degraded. The drive is no longer able to run any SMART test (the tests continue to abort, and I receive critical warnings stating the drive is not capable of SMART self-check/unable to read attribute). There were a bunch of checksum errors on the drive as well. In all, I chalked it up to a fake drive and was preparing to buy a new drive until...
After restarting the NAS (cleared 2k checksum errors on SDA2) and running a scrub on the pool I am now greeted with the below:
I am now worried that there is in fact a hardware failure further up the chain, or is this typical behavior with 1 faulty HDD that is redistributing its bad data to the other drives in the pool resulting in checksum errors? Or is this a ZFS failure? I should also note I added 64GB of RAM (128GB total) and ran memtest to 400% so I'm fairly sure it's not a memory error (non-ECC).
My fault finding has consisted of checking the caddy and cables. Next step would be to buy a new 16TB HDD, but I don't want to spend money on a (new!) new drive to find the onboard controller is faulty. Buying a used PCIe SATA controller would be great, but I'm already using the slot for the NIC and my other slots are blocked by water cooling components (yes, I am using an old gaming computer! I figure the MTBF for the pump says I have a lot of life left there...).
Thanks!
I've been happily using TrueNAS Scale for about a month now and I believe everything was setup correctly.
I've got 4 x 16TB Ironwolf drives and 1 x Ironwolf Pro drives to make a ZFS1 pool of 5, physically contained within a Silverstone FS305-12G 'caddy', which is then running off the onboard SATA controller on my motherboard Zenith Extreme for Threadripper (1 SATA cable per drive).
The 1 Pro HDD was bought from eBay and I did not run badblocks and simply verified runtime and size matched what the label said (I was impatient, lesson learnt). A month later I am having issues with playback in Plex etc and the GUI has thrown a critical warning that I have a degraded drive (it was the eBay drive - SDA2) and consequently the pool is degraded. The drive is no longer able to run any SMART test (the tests continue to abort, and I receive critical warnings stating the drive is not capable of SMART self-check/unable to read attribute). There were a bunch of checksum errors on the drive as well. In all, I chalked it up to a fake drive and was preparing to buy a new drive until...
After restarting the NAS (cleared 2k checksum errors on SDA2) and running a scrub on the pool I am now greeted with the below:
Code:
pool: tank state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 1.83M in 22:17:50 with 1021 errors on Tue Mar 5 08:17:25 2024 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 sdc2 DEGRADED 0 0 2.11K too many errors sdb2 DEGRADED 0 0 2.11K too many errors sda2 FAULTED 34 0 205 too many errors sde2 DEGRADED 0 0 2.11K too many errors sdf2 DEGRADED 0 0 2.12K too many errors errors: Permanent errors have been detected in the following files:
I am now worried that there is in fact a hardware failure further up the chain, or is this typical behavior with 1 faulty HDD that is redistributing its bad data to the other drives in the pool resulting in checksum errors? Or is this a ZFS failure? I should also note I added 64GB of RAM (128GB total) and ran memtest to 400% so I'm fairly sure it's not a memory error (non-ECC).
My fault finding has consisted of checking the caddy and cables. Next step would be to buy a new 16TB HDD, but I don't want to spend money on a (new!) new drive to find the onboard controller is faulty. Buying a used PCIe SATA controller would be great, but I'm already using the slot for the NIC and my other slots are blocked by water cooling components (yes, I am using an old gaming computer! I figure the MTBF for the pump says I have a lot of life left there...).
Thanks!