Reboot every 2 minutes - 'Failed to fully fault in a core file segment'

cmb21

Cadet
Joined
Sep 20, 2022
Messages
8
Hi All,

Unfortunately the time has come to reach out to the community for help...

Running Truenas-12.0-U8.1
5 x 4TB HDD
1 x 120GB SSD for boot

My fileserver went down so i rebooted and got back in and while in the truenas web gui it froze and then said it had lost connection. Couple more reboots and its crashing (or turning off) within a minute or 2 of booting.

Connected each drive to another PC and SMART is showing them all as good.
Had a search and similar issues came down to either a drive issue or hardware issue (sata port / contoller etc).

Threw together another board and CPU and used different SATA cables and the problem persists.
Removing the 4TB drives and just having the boot drive and i have no issues, so the issue is within that pool.

Because it shutdowns/crashes i cant copy the error messages so i have taken photos.

-Starts with a 'Failed to fully fault in a core file segment..... for rrdcached'
-Progresses to - vm_fault: pager read error, pid 1833 (rrdcached)
-Finally a bunch of these - rrdcached plugin: Failed to connect to RRDCached at....

Then a bunch of stuff that it runs through quickly, that i don't understand, before it restarts.

Initially the web user interface was down as the LAN cable wasn't in but i managed to reset that after the first error. Even if the web user interface loads straight away the problem and errors are the same.

Attached screenshots below, but what can i do to try and repair this, not sure if i should be looking at this from a hard drive failure perspective since SMART is showing them all as good, or if this is an issue with some corrupt data.

I am in the process of checking for BIOS updates for the newer hardware just in case.

Thanks.

1.jpg 2.jpg 3.jpg 4.jpg 5.jpg
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Almost certainly a hardware defect.
 

souporman

Explorer
Joined
Feb 3, 2015
Messages
57
It feels like boot media to me. I would get rid of that boot SSD and install on something else, import your config and see if things improve.
 

cmb21

Cadet
Joined
Sep 20, 2022
Messages
8
It feels like boot media to me. I would get rid of that boot SSD and install on something else, import your config and see if things improve.

Installed on a different SSD, imported config and as soon as i connect the 4TB drives i get the same behaviour.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Your system dataset pool appears to be corrupted, and is almost certainly the cause of these panics. What do you get when you bring the drives to a different system, and run zpool import?
 

cmb21

Cadet
Joined
Sep 20, 2022
Messages
8
Your system dataset pool appears to be corrupted, and is almost certainly the cause of these panics. What do you get when you bring the drives to a different system, and run zpool import?

Code:
root@truenas[~]# zpool import
   pool: fileserver
     id: 7870758930578502552
  state: ONLINE
status: The pool was last accessed by another system.
 action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
 config:

        fileserver                                      ONLINE
          raidz2-0                                      ONLINE
            gptid/b0e97c50-54f9-11ec-b55f-60a44c65160e  ONLINE
            gptid/b1a5e778-54f9-11ec-b55f-60a44c65160e  ONLINE
            gptid/b1fb3b0f-54f9-11ec-b55f-60a44c65160e  ONLINE
            gptid/b216c114-54f9-11ec-b55f-60a44c65160e  ONLINE
            gptid/b1ed1587-54f9-11ec-b55f-60a44c65160e  ONLINE


I assume the next step is to use the -f flag and try import the pool, but i won't jump the gun on that.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Hmm. The pool itself appears OK. Try booting your system under Ubuntu 20.04, and importing your pool there using zpool import -f -R /mnt fileserver. If that works, you can evacuate your data off the pool. Once your data is safe, you'll need to nuke the pool and start over on TrueNAS.

Do you have the results from smartctl for each disk?
 

cmb21

Cadet
Joined
Sep 20, 2022
Messages
8
Hmm. The pool itself appears OK. Try booting your system under Ubuntu 20.04, and importing your pool there using zpool import -f -R /mnt fileserver. If that works, you can evacuate your data off the pool. Once your data is safe, you'll need to nuke the pool and start over on TrueNAS.

Do you have the results from smartctl for each disk?
Thanks for that, i will give Ubuntu a crack and see how that goes.

Is there a particular command that would be best to use for smartctl, having not used it before i went with smartctl -a which provides a lot of information, but all disks are showing as 'passed' for the over all health assessment.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Please show the smartctl -a results for all 5 disks.
 

cmb21

Cadet
Joined
Sep 20, 2022
Messages
8
Please show the smartctl -a results for all 5 disks.

Results for all the disks below. Currently installing Ubuntu to try and mount the pool there.

Code:
root@truenas[~]# smartctl -a /dev/ada0
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 520 Series SSDs
Device Model:     INTEL SSDSC2CW060A3
Serial Number:    CVCV153408RW060AGN
LU WWN Device Id: 5 001517 bb27ce922
Firmware Version: 400i
User Capacity:    60,022,480,896 bytes [60.0 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 21 23:37:18 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 2097) seconds.
Offline data collection
capabilities:                    (0x7f) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0021) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       912659h+09m+19.420s
 12 Power_Cycle_Count       0x0032   096   096   000    Old_age   Always       -       4973
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       4400
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x000f   120   120   050    Pre-fail  Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4400
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       264728
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       65535
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       57
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       65535
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       264728
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       353966
249 NAND_Writes_1GiB        0x0013   100   100   000    Pre-fail  Always       -       10642

SMART Error Log not supported

SMART Self-test Log not supported

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
root@truenas[~]# smartctl -a /dev/ada1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K3NRVPN4
LU WWN Device Id: 5 0014ee 2bcb64d76
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep 22 00:02:52 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (42960) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 456) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   164   164   021    Pre-fail  Always       -       6758
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       93
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       10015
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       72
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       57
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       62
194 Temperature_Celsius     0x0022   121   109   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10007         -
# 2  Short offline       Completed without error       00%     10006         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
root@truenas[~]# smartctl -a /dev/ada2
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
Serial Number:    ZGY9QN9G
LU WWN Device Id: 5 000c50 0dc9dce92
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep 22 00:05:27 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  581) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 601) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   064   044    Pre-fail  Always       -       152061278
  3 Spin_Up_Time            0x0003   094   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       57
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   083   060   045    Pre-fail  Always       -       209999741
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6918 (195 30 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       57
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   062   040    Old_age   Always       -       30 (Min/Max 29/30)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       46
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       79
194 Temperature_Celsius     0x0022   030   040   000    Old_age   Always       -       30 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6917 (216 215 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       9915773381
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       32749480850

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6910         -
# 2  Short offline       Completed without error       00%      6909         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
root@truenas[~]# smartctl -a /dev/ada3
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
Serial Number:    ZGY9QNR6
LU WWN Device Id: 5 000c50 0dc9dcb53
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep 22 00:06:30 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  581) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 601) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   064   044    Pre-fail  Always       -       224250668
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       58
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   083   060   045    Pre-fail  Always       -       205277834
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6919 (144 20 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       58
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   062   040    Old_age   Always       -       31 (Min/Max 29/31)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       78
194 Temperature_Celsius     0x0022   031   040   000    Old_age   Always       -       31 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6917 (229 63 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       9916029556
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       32750287339

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6911         -
# 2  Short offline       Completed without error       00%      6910         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

cmb21

Cadet
Joined
Sep 20, 2022
Messages
8
Had to split the post because of character limit.

Code:
root@truenas[~]# smartctl -a /dev/ada4
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
Serial Number:    ZGY9QNN8
LU WWN Device Id: 5 000c50 0dc9db45d
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep 22 00:07:22 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  581) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 606) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   064   044    Pre-fail  Always       -       151781933
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       58
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   084   060   045    Pre-fail  Always       -       259959662
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6920 (162 69 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       58
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   073   064   040    Old_age   Always       -       27 (Min/Max 24/27)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       47
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       84
194 Temperature_Celsius     0x0022   027   040   000    Old_age   Always       -       27 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6918 (123 136 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       9915882997
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       32749411756

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      00%      6912         -
# 2  Short offline       Completed without error       00%      6910         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
root@truenas[~]# smartctl -a /dev/ada5
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
Serial Number:    ZGY9QNQZ
LU WWN Device Id: 5 000c50 0dc9de0bb
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep 22 00:08:25 2022 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  581) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 610) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   064   044    Pre-fail  Always       -       93218812
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       58
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   084   060   045    Pre-fail  Always       -       269799795
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6919 (114 78 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       58
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   062   040    Old_age   Always       -       29 (Min/Max 26/29)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       78
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always       -       29 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6917 (25 185 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       9915877459
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       32749404582

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6911         -
# 2  Short offline       Completed without error       00%      6910         -
# 3  Short offline       Completed without error       00%       141         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

cmb21

Cadet
Joined
Sep 20, 2022
Messages
8
Samuel, thank you for your assistance.
Looks like the Ubuntu trick will work a treat, pool mounted and i can access the folders. Currently backing up the data.

Interestingly i did a zpool status in ubuntu and got the following. Not really sure if this means anything for the drives or just confirms there has been some data corruption.

6.jpg
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Probably a bad cable / not plugged in properly on sdb2
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Probably a bad cable / not plugged in properly on sdb2
I doubt that. Since they're checksum errors, the data was delivered by the disk as good, but is actually bad. Bad cables would just (typically!) cause Read/Write errors.

Samuel, thank you for your assistance.
Looks like the Ubuntu trick will work a treat, pool mounted and i can access the folders. Currently backing up the data.

Interestingly i did a zpool status in ubuntu and got the following. Not really sure if this means anything for the drives or just confirms there has been some data corruption.

View attachment 58643
When you're done with whatever backup you're doing, definitely scrub the pool. I suspect there's more to this than we're seeing in that status.
 

cmb21

Cadet
Joined
Sep 20, 2022
Messages
8
I doubt that. Since they're checksum errors, the data was delivered by the disk as good, but is actually bad. Bad cables would just (typically!) cause Read/Write errors.


When you're done with whatever backup you're doing, definitely scrub the pool. I suspect there's more to this than we're seeing in that status.

Yeah will nuke it all and start fresh on Truenas 13, however at this point i am considering a hardware upgrade as well. This NAS was very much set and forget until i swapped out the original hard drives, so it still had the old 1150 socket H87i motherboard and 2000 series cpu.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
No need to nuke it, start by scrubbing the pool (and be sure to do so on a very recent OpenZFS, e.g. on TrueNAS 13). If all's good there, no point in nuking the pool.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
I would still recommend nuking the pool. Your SMART output shows you've not been scheduling periodic SMART tests on your drives, so that SMART output could be quite old.

If instead you opt not to nuke the pool, you'd need the following steps to salvage it:
  1. zpool scrub fileserver, to see if those 2 checksum errors can be corrected.
  2. zfs destroy -R fileserver/.system to whack the problematic system dataset
If the checksum errors are in metadata, you've no choice but to nuke the pool.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
And start to run periodic smart tests.
 
Top