Why do my hard drives keep failing?

sgt_jamez · Jul 20, 2023

RAIDZ1 pool failure

I’ve had a problem with hard drives dying in my TrueNAS SCALE box. It’s been an ongoing problem for some time and I’ve never really known what’s caused it but now I need to get to the bottom of the issue.

I have an LSI SAS9211-8i that I flashed to IT mode (I don’t have the firmware version in front of me but I can get it if needed). That is connected to three Seagate 8Tb Iron Wolf drives and to four mixed 1Tb drives that are in a hot swap housing. The three 8Tb drives were in a RAIDZ1 pool (yes I know not even close to ideal but its what I have to work with). The four mixed 1Tb make up a separate RAIDZ1 pool. The cabling is one CABLEDCONN Mini SAS SFF-8087 to SATA 90 degree on the 8Tb, and a Cable Matters on the 1Tb. The CPU and motherboard have been changed out and the issue continues.

The actual issue is that suddenly a drive will get a bunch of errors and the array becomes degraded. Sure it happens right? But I have been feeding the server a pretty steady diet of these 8Tb drives. A drive usually doesn’t last more than a few months. I haven’t known how to properly describe the issue and I have shied away from posting about it. Well now the worst has happened and I lost a second drive before I could deal with the first degraded drive and my pool is likely dead.

What is causing me to keep killing hard drives?
Is it the HBA?
The cabling?
Are the hard drives really dead/unreliable or is the system glitching for some reason and making the drives only seem bad?

I don’t know how to troubleshoot this but I very much need some help before I call in an airstrike on my server. For what it’s worth, I haven’t had this issue on the 1Tb pool. Only the 8Tb.

Here's some of the error text I received via email:

New alerts:

Device: /dev/sdc [SAT], Read SMART Error Log Failed.

Current alerts:

Device: /dev/sdb [SAT], failed to read SMART Attribute Data.
Device: /dev/sdc [SAT], failed to read SMART Attribute Data.
Pool main_vault state is UNAVAIL: One or more devices are faulted in response to IO failures.
The following devices are not healthy:
- Disk ST8000VN004-2M2101 WSD9M389 is UNAVAIL
- Disk ST8000VN004-2M2101 WSD9XXZ0 is FAULTED
Device: /dev/sdc [SAT], Read SMART Error Log Failed.

I was able run smartctl -x /dev/sdb and also smartctl -x /dev/sdc. It returned data on both. I can post that if it helps. I don’t know what other information I need to provide, but I would ask that someone please help me get to the bottom of this as I need this machine to quit eating hard drives.

Trevor68 · Jul 20, 2023

I've had a couple of very young 4Tb drives fault with a bunch of errors, pulled and replaced with new ones, which subsequently faulted the same way. I then replaced those with the first set which have been fine forever more, so clearly there was really nothing wrong with the drives.

sgt_jamez · Jul 20, 2023

In the past when I had this issue on a single drive, I put it into a Linux box and used GPARTED to nuke the partition. Put it back into the TrueNAS box and it error-ed out again. I don't remember if it even made it through resilvering.

samarium · Jul 20, 2023

I've seen this happen to other people's drives before. I don't know why.
I would be checking smart for clues.
I would be checking kernel logs, am I seeing actual data errors or maybe timeout errors?
If there were timeout errors in mpt3sas driver I would try limiting concurrent iorequests by setting the driver parameter. Default value of -1 implies 30000 from my reading if the kernel source. In other cases people have reduced to 10000 and that seems to be working for them. YMMV.

Fleshmauler · Jul 21, 2023

Did you burn-in the drive before deploying? Have you performed a burn-in test on these drives on a separate test machine post-failure to ensure it is the drives? Did you try swapping the cables (least time consuming option)?

Might take a few days, but at least if you can replicate failure on a second machine, then you'll know you just have really bad luck with drives.

//edit - May it won't hurt to also provide output of sas2flash -list. Who knows, maybe you only thought you flashed it into IT mode, or maybe firmware is on something older than the 'latest' 2016 release.

sgt_jamez · Jul 21, 2023

Burn-in is not something I've ever done, but if you can direct me to what you'd consider good info on burn testing I will definitely do that.
I have swapped to new cables.
I shut the machine down, and when I started back up the pool is back. 3 drives, one OK, one DEGRADED, one FAULTED. I have a spare drive so I replaced the FAULTED drive and the pool survived a re-silver. The pool is still flagged as DEGRADED since one of the drives is labeled DEGRADED. Oddly, it's error count went back to zero. Most of the data in the pool is backed up but not all. So I am addressing that now. Also I am making arrangements to replace the pool with a RAIDZ2.

I admit I don't really know what I am looking for in checking smart and kernel logs. I have been looking it over and I don't know how to determine what the smoking gun entries are.
Also, samarium mentioned setting a driver parameter. How would I do that?

//edit - sas2flash -list output:
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

Adapter Selected is a LSI SAS: SAS2008(B2)

Controller Number : 0
Controller : SAS2008(B2)
PCI Address : 00:07:00:00
SAS Address : 500605b-0-0107-5c70
NVDATA Version (Default) : 14.01.00.08
NVDATA Version (Persistent) : 14.01.00.08
Firmware Product ID : 0x2213 (IT)
Firmware Version : 20.00.07.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9211-8i
BIOS Version : 07.39.02.00
UEFI BSD Version : N/A
FCODE Version : N/A
Board Name : SAS9211-8i
Board Assembly : N/A
Board Tracer Number : N/A

Finished Processing Commands Successfully.
Exiting SAS2Flash.

Fleshmauler · Jul 21, 2023

A really quick & dirty burn in (please don't do this on your production system that is already experience issues; if you do, then at least ensure whatever drive you doing it on isn't on any pool & go over your commands 5 times over, this is a destructive test once we get to badblocks).

#Start with fast & easy smart tests, replace '%' with relevant disk
smartctl -t short /dev/sd%

#review smart output, if you don't know how to review it provide it here
smartctl -a /dev/sd%

#If it failed a short smart test... stop wasting time, dead drive

#this'll take a hot minute
smartctl -t long /dev/sd%

smartctl -a /dev/sd%

#once again; review/provide output - if failed, stop wasting time, dead drive

#for 8tb drives this likely eat up ~10 hours, the next part will be around 4 days; badblocks -c & -b config'd to what I find is best use of time for 8tb drives
#using tmux for the next bit so you can actually go back & see output if you're not crazy & didn't leave shell session open for 3 days - opening new tmux session/tab per disk

tmux

badblocks -c 2048 -b 4096 -wvs /dev/sd%

#ctrl+B for new tmux window if needed if testing multiple drives; 'tmux attach' in the shell to return to open tmux sessions. Google tmux for fuller overview

#//edit Once badblocks has finished running the disk(s), I'd run the smart tests again & review//edit

I tried to give some minimal insight. There are much better guides than this on the site, but if you just wanted the commands with some very minimal understanding this should be of use.

Burn in takes time, but it is time well spent to make sure a drive isn't already useless prior to production or to confirm a drive faulted instead of going crazy troubleshooting everything else possible.

sgt_jamez · Jul 21, 2023

Fleshmauler said:
A really quick & dirty burn in (please don't do this on your production system that is already experience issues; if you do, then at least ensure whatever drive you doing it on isn't on any pool & go over your commands 5 times over, this is a destructive test once we get to badblocks).

I tried to give some minimal insight. There are much better guides than this on the site, but if you just wanted the commands with some very minimal understanding this should be of use.

Burn in takes time, but it is time well spent to make sure a drive isn't already useless prior to production or to confirm a drive faulted instead of going crazy troubleshooting everything else possible.

I do try searching and I find so many thing and I get rabbit-hole'd pretty fast. Thank you!

//edit short test results:
=== START OF INFORMATION SECTION ===
Model Family: Seagate IronWolf
Device Model: ST8000VN004-2M2101
Serial Number: WSD9XXZ0
LU WWN Device Id: 5 000c50 0e6add43a
Firmware Version: SC60
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Jul 21 10:48:46 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 559) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 712) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 064 044 Pre-fail Always - 86926093
3 Spin_Up_Time 0x0003 089 089 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 5
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 082 060 045 Pre-fail Always - 144794000
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2541
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 5
18 Head_Health 0x000b 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 091 091 000 Old_age Always - 9
190 Airflow_Temperature_Cel 0x0022 069 054 040 Old_age Always - 31 (Min/Max 25/31)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 294
194 Temperature_Celsius 0x0022 031 046 000 Old_age Always - 31 (0 17 0 0 0)
195 Hardware_ECC_Recovered 0x001a 079 064 000 Old_age Always - 86926093
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2523 (57 247 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 14405195185
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 106533502622

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2541 -
# 2 Short offline Completed without error 00% 2467 -
# 3 Extended offline Completed without error 00% 2426 -
# 4 Short offline Completed without error 00% 2299 -
# 5 Short offline Completed without error 00% 2131 -
# 6 Short offline Completed without error 00% 1963 -
# 7 Short offline Completed without error 00% 1795 -
# 8 Extended offline Completed without error 00% 1706 -
# 9 Short offline Completed without error 00% 1627 -
#10 Short offline Completed without error 00% 1459 -
#11 Short offline Completed without error 00% 1291 -
#12 Short offline Completed without error 00% 1123 -
#13 Extended offline Completed without error 00% 962 -
#14 Short offline Completed without error 00% 787 -
#15 Short offline Completed without error 00% 619 -
#16 Short offline Completed without error 00% 451 -
#17 Short offline Completed without error 00% 283 -
#18 Extended offline Completed without error 00% 242 -
#19 Short offline Completed without error 00% 115 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I see PASSED at the top and completed without error for the #1 Short Offline test. So I am not sure what to make of:
1 Raw_Read_Error_Rate 0x000f 079 064 044 Pre-fail Always - 86926093
7 Seek_Error_Rate 0x000f 082 060 045 Pre-fail Always - 144794000
195 Hardware_ECC_Recovered 0x001a 079 064 000 Old_age Always - 86926093

Fleshmauler · Jul 21, 2023

ah, seagate; mind running
replace % with relevant

Code:

smartctl -a -v 1,raw48:54 /dev/sd% -v 7,raw48:54 -v 195,raw48:54

yeah - that'll translate the relevant fields

//edit
link for OG guide (was made for truenas core)

sgt_jamez · Jul 21, 2023

Fleshmauler said:
ah, seagate; mind running
replace % with relevant

Code:
smartctl -a -v 1,raw48:54 /dev/sd% -v 7,raw48:54 -v 195,raw48:54

yeah - that'll translate the relevant fields

//edit
link for OG guide (was made for truenas core)

Can I run that while the long test is in progress?

Is there a drive preferred over the Iron Wolf NAS models?

sgt_jamez · Jul 21, 2023

=== START OF INFORMATION SECTION ===
Model Family: Seagate IronWolf
Device Model: ST8000VN004-2M2101
Serial Number: WSD9XXZ0
LU WWN Device Id: 5 000c50 0e6add43a
Firmware Version: SC60
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Jul 21 14:04:05 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 247) Self-test routine in progress...
70% of test remaining.
Total time to complete Offline
data collection: ( 559) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 712) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 064 044 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 089 089 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 5
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 082 060 045 Pre-fail Always - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2544
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 5
18 Head_Health 0x000b 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 091 091 000 Old_age Always - 9
190 Airflow_Temperature_Cel 0x0022 059 054 040 Old_age Always - 41 (Min/Max 25/41)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 295
194 Temperature_Celsius 0x0022 041 046 000 Old_age Always - 41 (0 17 0 0 0)
195 Hardware_ECC_Recovered 0x001a 079 064 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2526 (70 229 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 14405195185
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 106533502622

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Self-test routine in progress 70% 2544 -
# 2 Short offline Completed without error 00% 2541 -
# 3 Short offline Completed without error 00% 2467 -
# 4 Extended offline Completed without error 00% 2426 -
# 5 Short offline Completed without error 00% 2299 -
# 6 Short offline Completed without error 00% 2131 -
# 7 Short offline Completed without error 00% 1963 -
# 8 Short offline Completed without error 00% 1795 -
# 9 Extended offline Completed without error 00% 1706 -
#10 Short offline Completed without error 00% 1627 -
#11 Short offline Completed without error 00% 1459 -
#12 Short offline Completed without error 00% 1291 -
#13 Short offline Completed without error 00% 1123 -
#14 Extended offline Completed without error 00% 962 -
#15 Short offline Completed without error 00% 787 -
#16 Short offline Completed without error 00% 619 -
#17 Short offline Completed without error 00% 451 -
#18 Short offline Completed without error 00% 283 -
#19 Extended offline Completed without error 00% 242 -
#20 Short offline Completed without error 00% 115 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Davvo · Jul 21, 2023

sgt_jamez said:
Is there a drive preferred over the Iron Wolf NAS models?

They are great drives, better than WD's Red Line.
The next step would be going Enterprise drives (ie WD Gold).

Please, please use [CODE][/CODE] tags when posting the outputs.

What is your power supply? Check the power connections and the cables for any damage, make sure everything is correctly seated.

Proper Power Supply Sizing Guidance

I've seen about 1,000 threads like this one where people decide that they can power a dozen hard drives off a 360 watt supply. DO NOT DO THIS. I've seen another 1,000 threads where people decide to buy the cheapest power supply that they can...

www.truenas.com

sgt_jamez · Jul 21, 2023

Davvo said:
They are great drives, better than WD's Red Line.
The next step would be going Enterprise drives (ie WD Gold).

Please, please use [CODE][/CODE] tags when posting the outputs.

What is your power supply? Check the power connections and the cables for any damage, make sure everything is correctly seated.

Proper Power Supply Sizing Guidance

I've seen about 1,000 threads like this one where people decide that they can power a dozen hard drives off a 360 watt supply. DO NOT DO THIS. I've seen another 1,000 threads where people decide to buy the cheapest power supply that they can...

www.truenas.com

It's an EVGA Semi Modular 80+ Bronze 600 watt.

Motherboard: Asrock B550 PG Velocita
CPU: AMD Ryzen 7 3700X (Noctuca NH-U12S cooler)
RAM: Corsair LPX 2x32Gb DDR4 3600
GPU: Asus nVidia GT730
Fans: Noctua 80mm x2, 120mm x5

//edit: added hardware list to my sig

Fleshmauler · Jul 21, 2023

sgt_jamez said:
Can I run that while the long test is in progress?

Is there a drive preferred over the Iron Wolf NAS models?

Yeah, it won't stop or interrupt the test; it'll just give the remaining estimated time along with all the other needful information.

There is no issues at all with Seagate - I've had good experiences with them. They just output smart test results a bit differently & require some very minor additional command effort to have their smart test output be in something read-able. Honestly your smart results are looking just fine to me. Same with your LSI SAS9211-8i.

Once long smart tests are done it'll be up to you if you want to continue onto the more data-destructive side of HDD testing or if you want to investigate the wires/your ram/your psu/whatever...

Once again, nothing wrong with running badblocks on a system that has data on it that you want to keep... but I'd personally just straight-up never do it. You never know when you're gonna boot-up & have the hard drive that was having issues changes from 'sdc' to 'sda' without any warning & just because :)

Davvo · Jul 21, 2023

sgt_jamez said:
A drive usually doesn’t last more than a few months.

This is not normal. Swap the cables, see if errors will come out on the 1TB drives.

CRC errors usually mean either a raid or toasty controller, a data cable issue, or a power issue.

What is the output of zpool status?

sgt_jamez · Jul 21, 2023

Fleshmauler said:
Yeah, it won't stop or interrupt the test; it'll just give the remaining estimated time along with all the other needful information.

There is no issues at all with Seagate - I've had good experiences with them. They just output smart test results a bit differently & require some very minor additional command effort to have their smart test output be in something read-able. Honestly your smart results are looking just fine to me. Same with your LSI SAS9211-8i.

Once long smart tests are done it'll be up to you if you want to continue onto the more data-destructive side of HDD testing or if you want to investigate the wires/your ram/your psu/whatever...

Once again, nothing wrong with running badblocks on a system that has data on it that you want to keep... but I'd personally just straight-up never do it. You never know when you're gonna boot-up & have the hard drive that was having issues changes from 'sdc' to 'sda' without any warning & just because :)

I have swapped out the cables but they are nothing special. Is there a recommended brand? The cables (CABLEDECONN and Cable Matters) were bought from Amazon.

Davvo said:
This is not normal. Swap the cables, see if errors will come out on the 1TB drives.

CRC errors usually mean either a raid or toasty controller, a data cable issue, or a power issue.

What is the output of zpool status?

Code:

pool: main_vault
 state: ONLINE
  scan: resilvered 3.67T in 07:13:27 with 0 errors on Fri Jul 21 03:53:54 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        main_vault                                ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            fd4c6852-88d1-4c72-8f0e-a1dfd268cf9b  ONLINE       0     0     0
            ad3bf73f-e1fd-4456-a33c-27d7cd4e1a8e  ONLINE       0     0     0
            a024a392-428c-452f-854d-c3c8fcff905d  ONLINE       0     0     0
        cache
          1229538a-e846-4fcf-b556-9db79aa1f5db    ONLINE       0     0     0

errors: No known data errors

I looked over the other drive that was coming up degraded but had no errors. I cleared the error on the pool and it's clean now. My HBA has a small 40mm Noctua fan running on the heatsink. Maybe I should re-paste the heatsink? is there a way I can check the temp of the controller? Maybe replace the cabling again with a more reputable brand?
Is 600 watts a bit anemic for my rig?

Davvo · Jul 21, 2023

sgt_jamez said:
I have swapped out the cables but they are nothing special. Is there a recommended brand? The cables (CABLEDECONN and Cable Matters) were bought from Amazon.

Code:
pool: main_vault state: ONLINE scan: resilvered 3.67T in 07:13:27 with 0 errors on Fri Jul 21 03:53:54 2023 config: NAME STATE READ WRITE CKSUM main_vault ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 fd4c6852-88d1-4c72-8f0e-a1dfd268cf9b ONLINE 0 0 0 ad3bf73f-e1fd-4456-a33c-27d7cd4e1a8e ONLINE 0 0 0 a024a392-428c-452f-854d-c3c8fcff905d ONLINE 0 0 0 cache 1229538a-e846-4fcf-b556-9db79aa1f5db ONLINE 0 0 0 errors: No known data errors

I looked over the other drive that was coming up degraded but had no errors. I cleared the error on the pool and it's clean now. My HBA has a small 40mm Noctua fan running on the heatsink. Maybe I should re-paste the heatsink? is there a way I can check the temp of the controller? Maybe replace the cabling again with a more reputable brand?
Is 600 watts a bit anemic for my rig?

So the pool is online and healthy?
I don't think you need to repast the heatsink, the fan should suffice. I don't know how, if possibile, check the HBA temperature.
It's not a matter of brand, if you already changed the cables we can reasonably exclude them from the issue.
Regarding the PSU do the math, the guide I linked tells you how. Eyeballing things you should be ok, but looking at the PSU specs it comes with six sata plugs... You have seven drives (plus the boot drive), how are you connecting them?

Fleshmauler · Jul 21, 2023

To check temps you're relying for the manufacturer of the device to have sensors on the equipment & support for commands to check those sensors; in short you're shit outa luck for the SAS9211-8i. I personally don't have fans on my SAS9211-8i, but I did has a sas cable fail - it happens. I've also had an HDD fail & the replacement fail (that one sucked to diagnose; I won the bad luck lottery, happens).

Am I saying your cables are bad? No, not really. Just something to check if you have a spare cable. Your issue is considered weird by me at this point & I'd have no clue if it was the wires, RAM, cpu, motherboard, SAS controller, bad HHD, or PSU.

Usually, random things that don't make any sense are PSU or RAM related... but my word isn't gospel. Make the checks that you are comfortable with & make sense given your time & budget.

sgt_jamez · Jul 21, 2023

Davvo said:
So the pool is online and healthy?
I don't think you need to repast the heatsink, the fan should suffice. I don't know how, if possibile, check the HBA temperature.
It's not a matter of brand, if you already changed the cables we can reasonably exclude them from the issue.
Regarding the PSU do the math, the guide I linked tells you how. Eyeballing things you should be ok, but looking at the PSU specs it comes with six sata plugs... You have seven drives (plus the boot drive), how are you connecting them?

Yes the pool is online and healthy for the time being. But this problem rears up every few months. Hopefully with the cable changed that will be the end of it.
I forgot to answer the rest of your question.
The 3-drive pool is connected directly to the PSU via the original cable from EVGA. The 4-drive pool is in a hot-swap bay with two SATA power inputs from the second original EVGA cable. That cable has an extender that powers the SSDs as well.

jgreco · Jul 21, 2023

sgt_jamez said:
I do try searching and I find so many thing and I get rabbit-hole'd pretty fast. Thank you!

Just go up to the Resources section.

Building, Burn-In, and Testing your FreeNAS system

I've been meaning to post some guidance here for a while now. We frequently see people come to the forums with hardware problems that should have washed out in the system build process, but since many of the users here are DIY'ers without professional experience building servers, it goes from...

www.truenas.com

Important Announcement for the TrueNAS Community.

Why do my hard drives keep failing?

Explorer

Contributor

Explorer

Contributor

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

Explorer

MVP

Explorer

Explorer

MVP

Explorer

MVP

Explorer

Explorer

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Why do my hard drives keep failing?"

Similar threads