HDD or Controller Failure?

Neonkore · Mar 5, 2024

Hi All,

I've been happily using TrueNAS Scale for about a month now and I believe everything was setup correctly.

I've got 4 x 16TB Ironwolf drives and 1 x Ironwolf Pro drives to make a ZFS1 pool of 5, physically contained within a Silverstone FS305-12G 'caddy', which is then running off the onboard SATA controller on my motherboard Zenith Extreme for Threadripper (1 SATA cable per drive).

The 1 Pro HDD was bought from eBay and I did not run badblocks and simply verified runtime and size matched what the label said (I was impatient, lesson learnt). A month later I am having issues with playback in Plex etc and the GUI has thrown a critical warning that I have a degraded drive (it was the eBay drive - SDA2) and consequently the pool is degraded. The drive is no longer able to run any SMART test (the tests continue to abort, and I receive critical warnings stating the drive is not capable of SMART self-check/unable to read attribute). There were a bunch of checksum errors on the drive as well. In all, I chalked it up to a fake drive and was preparing to buy a new drive until...

After restarting the NAS (cleared 2k checksum errors on SDA2) and running a scrub on the pool I am now greeted with the below:

Code:

pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 1.83M in 22:17:50 with 1021 errors on Tue Mar  5 08:17:25 2024
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            sdc2    DEGRADED     0     0 2.11K  too many errors
            sdb2    DEGRADED     0     0 2.11K  too many errors
            sda2    FAULTED     34     0   205  too many errors
            sde2    DEGRADED     0     0 2.11K  too many errors
            sdf2    DEGRADED     0     0 2.12K  too many errors

errors: Permanent errors have been detected in the following files:

I am now worried that there is in fact a hardware failure further up the chain, or is this typical behavior with 1 faulty HDD that is redistributing its bad data to the other drives in the pool resulting in checksum errors? Or is this a ZFS failure? I should also note I added 64GB of RAM (128GB total) and ran memtest to 400% so I'm fairly sure it's not a memory error (non-ECC).

My fault finding has consisted of checking the caddy and cables. Next step would be to buy a new 16TB HDD, but I don't want to spend money on a (new!) new drive to find the onboard controller is faulty. Buying a used PCIe SATA controller would be great, but I'm already using the slot for the NIC and my other slots are blocked by water cooling components (yes, I am using an old gaming computer! I figure the MTBF for the pump says I have a lot of life left there...).

Thanks!

ChrisRJ · Mar 5, 2024

Hardware details please

Neonkore · Mar 5, 2024

Hi @ChrisRJ - the details should be in my signature, was there anything else in particular that you were looking for?

The 4 Ironwolf HDDs are the ST16000VN0001 model and the 1 Ironwolf Pro (suspected dodgy) is the ST16000NE0001 model.
TrueNAS version is TrueNAS-SCALE-23.10.2 and the pool is at 16% usage.

ChrisRJ · Mar 5, 2024

Sorry, had overlooked the signature.

I would try to eliminate potential error sources. The prime candidate for me is the Silverstone FS305-12G. Also, getting a suitable HBA (no SATA controller) would be on my list; temporarily removing the NIC for that would be ok IMO.

First and foremost, though, I would replace the RAIDZ1 with a RAIDZ2. Yes that requires destruction of the pool and is not what you asked for. But for 16 TB drives RAIDZ1 is a relatively high risk.

Neonkore · Mar 6, 2024

Thanks for the response - yes the enclosure could be the culprit and would explain the checksum errors happening all of a sudden across all drives - I just think the 1 drive 'failing' first being the eBay drive is too coincidental.

I've picked up another 16TB drive and intend on off lining the potentially bad drive and introducing the new drive - I trust once I replace the affected files from backup there's no further chance of spreading 'bad bits' in redistributing the files across the pool? I figure this can be a relatively easy first step so long as there's no chance of damage...

I've started the hunt for a 6-bay external enclosure and a PCI HBA that I can attach with a ribbon to get out from the watercooling mess - that way I will move to RAIDZ2 without losing capacity (once I RMA the suspected drive). Any suggestions on a 'dumb' enclosure? I'm guessing the JBOD enclosures aren't recommended because they will treat the drives as a combined pool (unless I can turn that off). I'm reading your very informative guides in the background.

Thanks for your help.

PhilD13 · Mar 6, 2024

I have a supermicro JBOD chassis connected via an external port HBA card and cable for access and that works just fine as expansion. You can make vdevs from the expansion chassis drives and add the vdevs to an existing pool if you wish that was built on the main system. You can also create separate pool(s) on the expansion chassis if you want to go that way. I also have a QNAP expansion chassis that I have connected to a second system in a similar matter which also works fine. To advoid any issues, make sure the JBOD chassis is powered on before booting the server.

Neonkore · Mar 7, 2024

Thanks, I'm thinking:
QNAP TL-D800S JBOD which seems mighty overpriced, so I'll hunt for something similar
External Mini SAS HD SFF-8644 to Mini SAS SFF-8088 Hybrid Cable
LSI SAS9300-8e HBA

For now, I'll replace the HDD in the current exposure as a trial, provided there is no risk of further data corruption

PhilD13 · Mar 7, 2024

The Supermicro JBOD chassis I have is an older chassis and is the
Supermicro CSE-836E16-R92JBD 3U 16-Bay 3.5” Server Chassis JBOD BPN-SAS2-836EL1
It is an older 16 bay chassis but it does fine for what I need it for. I think it cost about 250.00

The QNAP JBOD chassis is a REXP 12 bay I already had and re-purposed. I remember it being rather expensive for what it actually is.

Their are plenty of quality used servers and JBOD chassis of various ages and configurations available that don't cost much from the various reputable server/refurbish places if you look around.
My primary server complete without drives cost around 900.00 and 16 x 8TB used drives also cost about that amount. The total was about 1000 less than I could buy a new empty 12 bay QNAP server for. That was why I went with Truenas and used equipment.

Neonkore · Mar 12, 2024

Resilvering the drive....how long is too long? Been running 5 days now, at 11% - estimate to complete is for another 5 weeks :D

Would it be safe to say that this is a hardware or controller issue further up the chain?

Code:

scan: resilver in progress since Thu Mar  7 22:37:45 2024
        7.21T / 11.8T scanned at 18.3M/s, 1.35T / 11.8T issued at 3.43M/s
        277G resilvered, 11.45% done, no estimated completion time

ChrisRJ · Mar 12, 2024

Way too long. Is it an SMR drive?

Neonkore · Mar 12, 2024

ChrisRJ said:
Way too long. Is it an SMR drive?

Ironwolf Pro is CMR. Replacing another pro like for like.

Apollo · Mar 12, 2024

Neonkore said:
I am now worried that there is in fact a hardware failure further up the chain, or is this typical behavior with 1 faulty HDD that is redistributing its bad data to the other drives in the pool resulting in checksum errors? Or is this a ZFS failure? I should also note I added 64GB of RAM (128GB total) and ran memtest to 400% so I'm fairly sure it's not a memory error (non-ECC).

My fault finding has consisted of checking the caddy and cables. Next step would be to buy a new 16TB HDD, but I don't want to spend money on a (new!) new drive to find the onboard controller is faulty. Buying a used PCIe SATA controller would be great, but I'm already using the slot for the NIC and my other slots are blocked by water cooling components (yes, I am using an old gaming computer! I figure the MTBF for the pump says I have a lot of life left there...).

It's a bit of a waste using a Threadripper with non-ECC memory and call it a gaming computer.
If you are really want to make a purchase decision, and do replacement before the resilvering has completed, I would probably consider removing the enclosure and have the HDD directly connected to the SATA ports of the motherboard and PSU.
The FS305-12G doesn't allow for a great airflow and the internal heat caused by the HDD could cause the enclosure, connector contacts to expand during extended workloads of the HDD's.
Weak power supply cable could cause the power domain to be degraded by the failing HDD affecting the other HDD's within the enclosure.

Are you overclocking your memory? I would expect your system to crash rather than consistently generating ZFS checksum faults.
What kind of watercooling system and how old is it? If using a non serviceable closed loop system, you might be running on dry. If using a custom loop, then maybe it's time to clean the CPU block. How old is your setup anyway?

Neonkore · Mar 12, 2024

Thanks Apollo.
It was an old gaming rig from late 2017.
Memory was expanded, it's also not overclocked any longer and memtest run to 800% with no errors before I setup TrueNAS, so I'm fairly confident on the memory stability.
Custom loop was fully stripped & cleaned, with new coolant before starting its second life as a server. Temperatures across the CPU/GFX/water temp are fine.
I was also concerned about the airflow over the FS305-12G however with the 1xfan it seems to keep the HDD temps to high 40 degrees C, which I wouldn't think is a problem. Power supply and cables are all what I would call 'high quality'. You may be 100% correct on the SATA cables though, they are 90CM long so there may be signal issues or there may be problems at the connection to the enclosure. However, for all drives to now be throwing checksum errors after a month says to me it probably isn't a cable issue but more of a controller issue.

I can't fit the HDD in the case due to the watercooling equipment, hence the requirement for an external enclosure. I think I'll have to bite the bullet, buy a separate 2U enclosure and an HBA, wipe the pool and take the opportunity to move to RAIDZ2 with a few more disks.

ChrisRJ · Mar 13, 2024

Neonkore said:
I can't fit the HDD in the case due to the watercooling equipment, hence the requirement for an external enclosure.

My recommendation would be to change that. You are adding potential sources of error (external enclosure and cabling) and gain nothing for the NAS. And I would assume that getting a regular size CPU fan is also cheaper.

Apollo · Mar 13, 2024

ChrisRJ said:
My recommendation would be to change that. You are adding potential sources of error (external enclosure and cabling) and gain nothing for the NAS. And I would assume that getting a regular size CPU fan is also cheaper.

Threadripper 1950X series CPU heatsink could be a problem to source as the form factor is no longer supported.

I had to replace my off-the-shelf water cooling loop for my Threadripper 2950X right before christmas as I was also running on dry and started causing NVME UEFI boot partition on Windows 11 to be corrupted. I was only able to find one water cooler to fit the sTR4 socket.
While a few brands seem to claim compatibility, it would seem the adapter needs to be purchased separately, directly from the manufacturer.
I guess I was lucky as I bought the very last water cooler from my area.

My server is based on the ASRock Taichi X399 with Threadripper 1900X. I have an Arctic Freezer 33TR to keep it cool'ish.
A better CPU heatsink if available could be used.

ChrisRJ · Mar 13, 2024

I was more thinking in the direction of a conventional cooler ...

Apollo · Mar 14, 2024

ChrisRJ said:
I was more thinking in the direction of a conventional cooler ...

The Arctic Freezer 33TR is a conventional cooler. The OP can use that as a base line and see if a more capable cooler is available in order to account for higher power dissipation by the CPU.

Important Announcement for the TrueNAS Community.

HDD or Controller Failure?

Neonkore

Dabbler

ChrisRJ

Wizard

Neonkore

Dabbler

ChrisRJ

Wizard

Neonkore

Dabbler

PhilD13

Patron

Neonkore

Dabbler

PhilD13

Patron

Neonkore

Dabbler

ChrisRJ

Wizard

Neonkore

Dabbler

Apollo

Wizard

Neonkore

Dabbler

ChrisRJ

Wizard

Apollo

Wizard

ChrisRJ

Wizard

Apollo

Wizard

Similar threads