Checksum errors on all drives in the pool

cmh · Aug 11, 2023

Have searched, read through a couple posts that looked related, nothing seemed to match up quite right.

Problem summary:

Received single checksum error two months ago, cleared. Got it again yesterday and in the process of figuring out which drive some things were done badly, and then I got checksum errors on all five drives in the pool. Tried cleaning it and reseating all cables and now it won't boot.

Details:

I've got an old FreeNAS Mini motherboard - got it from a friend when support replaced his, he had already bought another. It's connected to two SSDs for the boot mirror and (5) 8TB Seagate Iron Wolf drives, all bought early in 2018 across the span of four months. Drives are in a SuperMicro 5 bay unit in 3x5.25" bays in an old case. Power supply was modern at the time of the build - 380W Antec.

TrueNAS Core installed on the SSDs, updated to 13.0-U5.3 last week. (I've also got a backup NAS, my previous, that I upgrade first) Main pool is a RAIDZ2.

On June 27th, I got a warning about a single checksum error. Followed the procedure and reset the count, triggered a scrub of the pool and confirmed all my drives were part of both the long and short SMART tests. All good.

Yesterday I got the warning again, checked and it was a single checksum error again. Was busy, had several other things going on, so I did the zpool clear again and didn't note which drive had the errors. (didn't note the drive in June, either - I know, dumb, but I really didn't want to have to deal with it at that moment)

Last night it failed again, but this time it was a bunch of checksum errors. Pool was still okay, everything was working. Triggered a scrub and wound up with over 3000 checksum errors on all five drives, but the scrub finished this morning, having fixed 694M out of 36TB. Grabbed a portion of the zpool status output:

Code:

scan: scrub repaired 694M in 12:13:17 with 0 errors on Fri Aug 11 07:45:10 2023
config:

    NAME                                            STATE     READ WRITE CKSUM
    sto                                             ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/d012e43e-67e7-11e8-9745-d05099c38c29  ONLINE       0     0 3.14K
        gptid/d0fa3a57-67e7-11e8-9745-d05099c38c29  ONLINE       0     0 3.15K
        gptid/d1dffb15-67e7-11e8-9745-d05099c38c29  ONLINE       0     0 3.21K
        gptid/d2e4a62a-67e7-11e8-9745-d05099c38c29  ONLINE       0     0 3.21K
        gptid/d3d3b8b0-67e7-11e8-9745-d05099c38c29  ONLINE       0     0 3.18K

Oh I should also mention that I looked at the smartctl -a output for all of those drives, and <i>all five</i> had a significant amount of errors. Raw read error rate, seek error rate, and G-Sense error rate were all concerningly high. I didn't grab that output, and like a bonehead I took the opportunity to finally install that pending iTerm update, so all my scrollbacks are lost. Hard pressed to imagine all five disks failed simultaneously, and I had it configured to run SMART tests on all disks and haven't heard a peep from that. Drives were purchased in 2018, 1 in Feb, 3 in Mar, and the last one in May, so I really doubt they're from the same batch. Why did I spread the purchases across 4 months? That's a question for past me, because present me has absolutely no idea.

At this point it's feeling like maybe a power supply issue, or a bad connection, or possible the controller on the motherboard?

So I shut it down, take it outside, blow it out (not really all that much dust despite living in the basement) and reconnect every cable. I look at the system board and nothing looks amiss, but it would have to be spewing smoke or be noticeably damaged for me to pick up on it.

Bring it back downstairs and plug it in and it starts to boot, but gets to "lo0: link state changed to UP" and that's the last thing it does. Have given it plenty of time to get past that point and so far nothing.

Anyone have any thoughts or suggestions about how I might be able to get this thing back online? I'm thinking it might be time to pony up and order up a new TrueNAS Mini, but I'd like to get this system back up and running at least until that gets here. Considering reinstalling onto a different drive and seeing if that works.

Thanks, hopefully info is as sufficient as it can be with me not having logged the data that sure would have been useful to have now.

cmh · Aug 11, 2023

Update: Unplugged both drives from the OS mirror and reinstalled on a spare drive. Got it up and running, imported the pool, and restored my config backup from the U5.3 update at the beginning of the month. System is back, showing one checksum error on second drive, but I'm super suspicious of this old system on pretty much every level now. Think I'm going to let it run and am looking at a new TrueNAS Mini with all new drives. I mean, the drives in there are from 2018, I guess they've done their job.

What is concerning is I had a mirror on the OS and both of those drives failed to boot the same way. So what's the point of a boot mirror, then, one wonders?

Davvo · Aug 11, 2023

It's unlikely that both drives failed, it's much more likely that just one did (and partially too!).

The BIOS is unable to understand which of the two is right, so it hangs on boot.

That's why mirroring the boot pool isn't worth it without going the full way.

Highly Available Boot Pool Strategy

I've pounded a few versions of this out over the years, but I hate explaining over and over. TrueNAS allows a ZFS boot mirror pool to be created to increase the reliability of the NAS. This sounds great in theory, but there's a flaw. Due to...

www.truenas.com

How are your drives connected to the motherboard? I would run a long smart test on every drive in the system.

cmh · Aug 11, 2023

Would that still be the case if I specifically chose the boot drive from the boot menu? Going off memory, I had UEFI shell, then the two individual drives, then extra stuff. (boot from LAN maybe? I mentally filtered it since it wasn't OS disks)

That's a seriously cool doc, though, thanks for sharing! Dunno as I'll go in _that_ deep since keeping the config backup handy and reinstalling will be well within my SLA for my home NAS.

Davvo · Aug 11, 2023

Manually selecting the sane boot drive should result in a successful boot.

joeschmuck · Aug 11, 2023

cmh said:
I looked at the smartctl -a output for all of those drives, and <i>all five</i> had a significant amount of errors. Raw read error rate, seek error rate, and G-Sense error rate were all concerningly high.

You should know that some drives will report high error rates but in reality the data is encoded and needs to be decoded. Odds are with the exception of the Gravity value, the other values are not of concern. However, you should post the output of the smartctl -a data for at least one drive so we can look at it to see if there is something of concern.

cmh said:
What is concerning is I had a mirror on the OS and both of those drives failed to boot the same way. So what's the point of a boot mirror, then, one wonders?

If your motherboard does not support fail over for the boot device, yea, why have mirrored boot drives. I only use a single drive and maintain a current copy of my TrueNAS configuration files so restoration is easy should I ever need it.

cmh · Aug 12, 2023

Motherboard was out of a FreeNAS Mini so I would think it supports boot fail over, but I dunno. Mind you it's an old FreeNAS Mini, it well predates the 2018 drives, but it does have IPMI so it is server-ish at least.

Current pool status after OS reload and configuration reload from Aug 2nd backup:

Code:

  pool: sto
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 694M in 12:13:17 with 0 errors on Fri Aug 11 07:45:10 2023
config:

    NAME                                            STATE     READ WRITE CKSUM
    sto                                             ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/d012e43e-67e7-11e8-9745-d05099c38c29  ONLINE       0     0     0
        gptid/d0fa3a57-67e7-11e8-9745-d05099c38c29  ONLINE       0     0     1
        gptid/d1dffb15-67e7-11e8-9745-d05099c38c29  ONLINE       0     0     1
        gptid/d2e4a62a-67e7-11e8-9745-d05099c38c29  ONLINE       0     0     1
        gptid/d3d3b8b0-67e7-11e8-9745-d05099c38c29  ONLINE       0     0     0

errors: No known data errors

The second drive had a checksum error pretty quickly, the third was later yesterday, and the fourth happened sometime before me just checking now.

Output of smartctl -a for the second drive:

Code:

2-NASTY:~$ sudo smartctl -a /dev/ada1
Password:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST8000VN0022-2EL112
Serial Number:    ZA19F5MG
LU WWN Device Id: 5 000c50 0a55b01cf
Firmware Version: SC61
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Aug 12 14:52:52 2023 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  567) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 790) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   064   044    Pre-fail  Always       -       229669719
  3 Spin_Up_Time            0x0003   085   085   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       41
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   085   060   045    Pre-fail  Always       -       351631666
  9 Power_On_Hours          0x0032   048   048   000    Old_age   Always       -       45956
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       41
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       4295032833
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   056   040    Old_age   Always       -       37 (Min/Max 36/38)
191 G-Sense_Error_Rate      0x0032   089   089   000    Old_age   Always       -       23416
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1332
193 Load_Cycle_Count        0x0032   097   097   000    Old_age   Always       -       6854
194 Temperature_Celsius     0x0022   037   044   000    Old_age   Always       -       37 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   004   001   000    Old_age   Always       -       229669719
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       45060h+01m+59.441s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       144830204795
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       549268089965

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     45869         -
# 2  Short offline       Completed without error       00%     45797         -
# 3  Short offline       Completed without error       00%     45701         -
# 4  Extended offline    Completed without error       00%     45697         -
# 5  Short offline       Completed without error       00%     45629         -
# 6  Short offline       Completed without error       00%     45533         -
# 7  Short offline       Completed without error       00%     45461         -
# 8  Short offline       Completed without error       00%     45365         -
# 9  Short offline       Completed without error       00%     45293         -
#10  Short offline       Completed without error       00%     45197         -
#11  Short offline       Completed without error       00%     45125         -
#12  Short offline       Completed without error       00%     45029         -
#13  Short offline       Completed without error       00%     44957         -
#14  Extended offline    Completed without error       00%     44948         -
#15  Short offline       Completed without error       00%     44861         -
#16  Short offline       Completed without error       00%     44789         -
#17  Short offline       Completed without error       00%     44693         -
#18  Short offline       Completed without error       00%     44621         -
#19  Short offline       Completed without error       00%     44525         -
#20  Short offline       Completed without error       00%     44453         -
#21  Short offline       Completed without error       00%     44357         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

joeschmuck · Aug 12, 2023

cmh said:
Motherboard was out of a FreeNAS Mini so I would think it supports boot fail over, but I dunno.

Boot failover without a true failover raid product is apt to not work. I doubt the Mini has this feature but I could be wrong. It would need a RAID card that would detect a boot failure and force the other drive to boot.

I don't like the ID 195 Hardware_ECC_Recovered value but that does not mean the drive is actually failing. The raw read and seek error rates are not an issue. The only problem I see with your hard drive is the G-Sense error rate. I don't know if the sensor is faulty, the value is totally invalid, or you have it mounted on a Po-Go stick.
G-sense error rate S.M.A.R.T. parameter indicates the number of errors caused by externally-induced shock or vibration.

Do you have this drive mounted properly? That is a very large value but you have no failing indications as far as I can see.

cmh · Aug 15, 2023

joeschmuck said:
Boot failover without a true failover raid product is apt to not work. I doubt the Mini has this feature but I could be wrong. It would need a RAID card that would detect a boot failure and force the other drive to boot.

Weird, thought I had replied, maybe I didn't actually post?

If we're relying on a boot failure, that wasn't happening - it would start the boot, just wouldn't complete it, so the BIOS would have never known to fail to the second drive. Still, trying to boot off of the drives individually I'd have thought one would work, but maybe once it's started to boot then the mirror comes into play? Much about the boot process I'm realizing I don't understand at all.

joeschmuck said:
I don't like the ID 195 Hardware_ECC_Recovered value but that does not mean the drive is actually failing. The raw read and seek error rates are not an issue. The only problem I see with your hard drive is the G-Sense error rate. I don't know if the sensor is faulty, the value is totally invalid, or you have it mounted on a Po-Go stick.
G-sense error rate S.M.A.R.T. parameter indicates the number of errors caused by externally-induced shock or vibration.

Do you have this drive mounted properly? That is a very large value but you have no failing indications as far as I can see.

I had wondered if that's what G-sense meant, but didn't look it up. That is super strange. The drives are mounted in a Supermicro 5 drive hot swap bay that fits in 3 5 1/4" drive bays in the case. Everything's bolted down and properly secured and the server lives at the bottom of my rack in the basement - sitting on a concrete floor. So unless drive gnomes are sneaking in at night and hitting my drives with hammers, there is absolutely no reason for these drives to be giving shock related errors. Even when I move the server I'll power it down first and then move it the way you would expect someone to move a server with so much important data on it.

Checked this morning and still only three drives with single checksum errors. Still not bold enough to run a full scrub and see if that stays the same.

cmh · Aug 15, 2023

Somewhat related, got the drives for the replacement Mini X, 2 SSDs and 5 WD 10T data drives. The Mini itself is scheduled for end of September delivery. Wondering if I should run some form of test on the drives just to make sure there are no issues with them so I know now vs. waiting until September.

WI_Hedgehog · Aug 15, 2023

I would guess there was a fan failure that cooked the [old] drive and spent the bearing lubricant, hence the vibrations and high error rate. To me it looks like this drive needs to be replaced ASAP.

Model Family: Seagate IronWolf
Device Model: ST8000VN0022-2EL112
Serial Number: ZA19F5MG
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 084 064 044 Pre-fail Always - 229669719
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 085 060 045 Pre-fail Always - 351631666
9 Power_On_Hours 0x0032 048 048 000 Old_age Always - 45956
190 Airflow_Temperature_Cel 0x0022 063 056 040 Old_age Always - 37 (Min/Max 36/38)
191 G-Sense_Error_Rate 0x0032 089 089 000 Old_age Always - 23416
194 Temperature_Celsius 0x0022 037 044 000 Old_age Always - 37 (0 14 0 0 0)
195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always - 229669719

Davvo · Aug 15, 2023

cmh said:
Somewhat related, got the drives for the replacement Mini X, 2 SSDs and 5 WD 10T data drives. The Mini itself is scheduled for end of September delivery. Wondering if I should run some form of test on the drives just to make sure there are no issues with them so I know now vs. waiting until September.

You have to run a smart long test on each drive.
You could (should?) burn-in the hard drives.

WI_Hedgehog · Aug 15, 2023

cmh said:
Somewhat related, got the drives for the replacement Mini X, 2 SSDs and 5 WD 10T data drives. The Mini itself is scheduled for end of September delivery. Wondering if I should run some form of test on the drives just to make sure there are no issues with them so I know now vs. waiting until September.

check into badblocks -w to find errors that aren't otherwise known about.

joeschmuck · Aug 15, 2023

WI_Hedgehog said:
and high error rate

The high error rate is not an issue. For many Seagate drives the actual error rate becomes valid if the number is above 4294967295, and if it is above this value you divide the error number by 4294967295 to give you the actual error rate.

229,669,719 - Both ID 1 and ID 195 values, coincidence? ID 195 falls under the same decoding.
351,631,666
the magic number is:
4,294,967,295 which equals FFFFFFFF in Hex. If the value is below FFFFFFFF then no errors.

Based on the SMART data, the drive is fine. The only concern I'd have is that the drive is old and I'd recommend you replace it before it does fail, but as of right now, you do not have any failures.

cmh said:
Still not bold enough to run a full scrub and see if that stays the same.

Run the SCRUB. If the values remain the same then you can run a zpool clear sto to clear the errors.

cmh said:
Wondering if I should run some form of test on the drives just to make sure there are no issues with them so I know now vs. waiting until September.

As already suggested for your new hard drives, badblocks. I would caution you to disconnect any drives you do not want to accidentally wipe clean and you might just want to boot up Ubuntu Live CD and run the test. Again, disconnect all but the drives you want to test.

cmh · Aug 15, 2023

joeschmuck said:
The high error rate is not an issue. For many Seagate drives the actual error rate becomes valid if the number is above 4294967295, and if it is above this value you divide the error number by 4294967295 to give you the actual error rate.

Okay so the temperature readings and such aren't any cause for concern?

joeschmuck said:
Based on the SMART data, the drive is fine. The only concern I'd have is that the drive is old and I'd recommend you replace it before it does fail, but as of right now, you do not have any failures.

Yep, five fresh drives that I'll be migrating to once the Mini shows up. I just need these drives to last until the new box comes in, I get it configured, and I get the data over there.

Plus, my backup NAS (let's call it NAS v1, it was promoted to backup when this came online) I usually only power up at the beginning of the month to get all the replications freshened. Now it's running nonstop as are the replications, so if the thing goes completely to pot I have the key data backed up pretty close to current. I'd still lose data as that pool is significantly smaller, but the really important stuff is there.

joeschmuck said:
Run the SCRUB. If the values remain the same then you can run a zpool clear sto to clear the errors.

OK. Only thing is I ran the scrub last time and that coincided with the OS getting wonky and the system's inability to boot. Could be completely unrelated, and most likely is since the scrub was run on the data pool, not the OS pool, but it still made me gunshy.

joeschmuck said:
As already suggested for your new hard drives, badblocks. I would caution you to disconnect any drives you do not want to accidentally wipe clean and you might just want to boot up Ubuntu Live CD and run the test. Again, disconnect all but the drives you want to test.

Got just the system for that. Bet a badblocks write test is going to take just a tiny bit of time on 5 10TB drives!

Is there any point doing the same for the 2 SSDs I got for the OS pool?

Thank you!

WI_Hedgehog · Aug 15, 2023

joeschmuck said:
The high error rate is not an issue. For many Seagate drives the actual error rate becomes valid if the number is above 4294967295, and if it is above this value you divide the error number by 4294967295 to give you the actual error rate.

229,669,719 - Both ID 1 and ID 195 values, coincidence? ID 195 falls under the same decoding.
351,631,666
the magic number is:
4,294,967,295 which equals FFFFFFFF in Hex. If the value is below FFFFFFFF then no errors.

Based on the SMART data, the drive is fine. The only concern I'd have is that the drive is old and I'd recommend you replace it before it does fail, but as of right now, you do not have any failures.

Uhhh, I dont know.
VALUE WORST THRESH
(decoded values) all look pretty sorry on the items I listed.

Admittedly Seagate drive raw values are a bit fun to decode (I have a few), but I don't remember running into VALUE WORST THRESH that looked that ugly. That data is on the jump drive I accidentally wiped a few weeks back so I can't say for sure. It does seem to coronation with what you said elsewhere.

joeschmuck · Aug 16, 2023

cmh said:
Okay so the temperature readings and such aren't any cause for concern?

provide the output of smartctl -x /dev/xxx and you should see the drives absolute max temp, but if you had a single spike at 57C and the drives limit to remain under warranty is 60C, then no, it's not a concern to me.

WI_Hedgehog said:
Admittedly Seagate drive raw values are a bit fun to decode (I have a few), but I don't remember running into VALUE WORST THRESH that looked that ugly. That data is on the jump drive I accidentally wiped a few weeks back so I can't say for sure. It does seem to coronation with what you said elsewhere.

Seagate confused the hell out of me when writing my little script but that is what I found out over the years. I could not locate anything on the VALUE, WORST, THRESH for these items, just the RAW data decoding. Damn Seagate.

cmh said:
Got just the system for that. Bet a badblocks write test is going to take just a tiny bit of time on 5 10TB drives!

Is there any point doing the same for the 2 SSDs I got for the OS pool?

Yes, it will take you days per drive, but know that badblocks runs either 4 or 5 different patterns, it's not done until it's done.

NO! Do not run badblocks on a SSD/NVMe. Not unless you want to prematurely wear it out. You can do a SMART Long test, that will read all the blocks to make sure the drive can read. The SSD's are for the bootpool so even if they both fail at the same time, it's not a big loss. Keep a copy of your configuration files and you will be able to easily recover for a total bootpool failure.

cmh said:
OK. Only thing is I ran the scrub last time and that coincided with the OS getting wonky and the system's inability to boot. Could be completely unrelated, and most likely is since the scrub was run on the data pool, not the OS pool, but it still made me gunshy.

That is fine, when you read the data to transfer it, the data integrity is verified and either the data will be good and transfer or you will have some data corruption. So what I'm saying is because you plan to transfer your data soon, as long as you do not have a drive failure, your data would remain as it presently is. If you have a drive failure and data that could have been repaired before, it may not be able to be repaired.

Also, a SCRUB does put extra stress on a drive, so it's your call. Damned if you do, damned if you don't.

WI_Hedgehog · Aug 16, 2023

I'm going to respectfully add a few things to what @joeschmuck said, and while my opinions are well-founded i don't consider them absolutes.

Personally, I've observed high drive temperatures conclusively correlate with premature drive failure, however my sample size isn't large enough. I found drives kept under 33C far outlast the same drives in the same system at 35+C, drives run at 42C last 5 years then develop problems, drives run at 44C last 4 years then develop problems, 45C last 3 years, 48C 2 years, 50C 1 year (roughly). The newer drives that max out at 65C don't last much longer than the old drives of max 45C (maybe 20%). That's not rigorous research, though it seems a reasonable rough guideline, which seems to corollary with your drive in my opinion. How long the high temp spike lasted would relate to how much the drive life was shortened.

I wouldn't run a scrub, I'd put as little stress on the drive as possible and get the data backed up ASAP. If it failed a scrub could be tried at that point.

It could be VALUE WORST THRESH are misleading and the drive is fine, in which case you get the data off safely and some needless worrying happened. However if they are correct and data is lost that's not good, so erring on the side of caution might be prudent. Once the drive is replaced you can hammer it with badblocks and see how it reacts (I'd do that in a test system though, not the server).

cmh · Aug 16, 2023

joeschmuck said:
provide the output of smartctl -x /dev/xxx and you should see the drives absolute max temp, but if you had a single spike at 57C and the drives limit to remain under warranty is 60C, then no, it's not a concern to me.

Made the mistake of running smartctl -x without any filtering, that was a bunch of data. Ran it again looking for `high.*temper` and that was quite a bit more usable:

Code:

===== /dev/ada0 =====
0x05  0x020  1              40  ---  Highest Temperature
0x05  0x030  1              38  ---  Highest Average Short Term Temperature
0x05  0x040  1              36  ---  Highest Average Long Term Temperature

===== /dev/ada1 =====
0x05  0x020  1              44  ---  Highest Temperature
0x05  0x030  1              41  ---  Highest Average Short Term Temperature
0x05  0x040  1              39  ---  Highest Average Long Term Temperature

===== /dev/ada2 =====
0x05  0x020  1              44  ---  Highest Temperature
0x05  0x030  1              42  ---  Highest Average Short Term Temperature
0x05  0x040  1              39  ---  Highest Average Long Term Temperature

===== /dev/ada3 =====
0x05  0x020  1              43  ---  Highest Temperature
0x05  0x030  1              41  ---  Highest Average Short Term Temperature
0x05  0x040  1              39  ---  Highest Average Long Term Temperature

===== /dev/ada4 =====
0x05  0x020  1              43  ---  Highest Temperature
0x05  0x030  1              41  ---  Highest Average Short Term Temperature
0x05  0x040  1              38  ---  Highest Average Long Term Temperature

===== /dev/ada5 =====
0x05  0x020  1              40  ---  Highest Temperature
0x05  0x030  1              -1  ---  Highest Average Short Term Temperature
0x05  0x040  1              -1  ---  Highest Average Long Term Temperature

Those numbers seem reasonable, max of 44C.

joeschmuck said:
Yes, it will take you days per drive, but know that badblocks runs either 4 or 5 different patterns, it's not done until it's done.

Well, according to shipping predictions, I've got until the end of September. Just need to reload the OS on the system that will be running the tests as right now it crashes daily, that's kinda suboptimal for a long term badblocks run.

joeschmuck said:
NO! Do not run badblocks on a SSD/NVMe. Not unless you want to prematurely wear it out. You can do a SMART Long test, that will read all the blocks to make sure the drive can read. The SSD's are for the bootpool so even if they both fail at the same time, it's not a big loss. Keep a copy of your configuration files and you will be able to easily recover for a total bootpool failure.

Yep, confirms what I thought - thank you!

joeschmuck said:
That is fine, when you read the data to transfer it, the data integrity is verified and either the data will be good and transfer or you will have some data corruption. So what I'm saying is because you plan to transfer your data soon, as long as you do not have a drive failure, your data would remain as it presently is. If you have a drive failure and data that could have been repaired before, it may not be able to be repaired.

Also, a SCRUB does put extra stress on a drive, so it's your call. Damned if you do, damned if you don't.

I guess the big question is, once I have the new one up and running, I would like to run this one as the backup (finally retire the old old one) and I guess the same "badblocks" would help me figure out which drive should be replaced for a reliable backup pool?

Thanks!

joeschmuck · Aug 16, 2023

WI_Hedgehog said:
Personally, I've observed high drive temperatures conclusively correlate with premature drive failure, however my sample size isn't large enough. I found drives kept under 33C far outlast the same drives in the same system at 35+C, drives run at 42C last 5 years then develop problems, drives run at 44C last 4 years then develop problems, 45C last 3 years, 48C 2 years, 50C 1 year (roughly). The newer drives that max out at 65C don't last much longer than the old drives of max 45C (maybe 20%). That's not rigorous research, though it seems a reasonable rough guideline, which seems to corollary with your drive in my opinion. How long the high temp spike lasted would relate to how much the drive life was shortened.

I agree the temperature seems to have an affect on longevity, that is why my drives normally run at 40-45C. It's warm upstairs in the loft. When I had the system in my previous house basement, I was in the 30's and drives were happy.

And I do not know everything, I learn new things all the time. And it's okay to agree to disagree, I'm good with that. People have opinions and everyone will not have the same opinion.

cmh said:
Those numbers seem reasonable, max of 44C.

They are very reasonable. Nothing wrong with those at all.

cmh said:
I guess the big question is, once I have the new one up and running, I would like to run this one as the backup (finally retire the old old one) and I guess the same "badblocks" would help me figure out which drive should be replaced for a reliable backup pool?

This is just my personal opinion, but I would not run badblocks on your old drives. You are apt to cause one to fail. And if that is what you want, that is fine. But this is a backup. During routine SMART tests and SCRUBs, the backup will throw alarms as needed. If you also retain the same data on the primary NAS then you have not lost any data. I'd run your drives as long as possible, who knows, I saw a fell with a drive that had over 9.4 years of power on time and it was still working good. I was surprised. So do not spend money unless you have to. Again, that is purely my opinion. I'm sure there are others here who have a completely different opinion.

Best of luck to you.

Important Announcement for the TrueNAS Community.

Checksum errors on all drives in the pool

Explorer

Explorer

MVP

Explorer

MVP

Old Man

Explorer

Old Man

Explorer

Explorer

Guru

MVP

Guru

Old Man

Explorer

Guru

Old Man

Guru

Explorer

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Checksum errors on all drives in the pool"

Similar threads