FreeNAS 11.1, multiple resilvers after disk replacement

djs11 · Apr 29, 2019

I've been a using FreeNAS for a little while now, embarrassingly this is my first post and it’s asking for help!

Hardware: HPE ProLiant MicroServer Gen8
Intel® Celeron® G1610T (2.3Hz/2-core/2MB/35W) Processor
2x 4GB-DDR3-ECC-Memory-RAM
4 x WD Red 2TB 3.5" SATA
Boot: Dual - Kingston 16GB DataTraveler SE9 USB Flash Drive
FreeNAS-11.1-U6

My setup has been 4 x WD Red 2TB 3.5" SATA drives as my pool in a raidz1-0 (which I have a couple of backups of) but not of the five jails or couple of VM's.

Current zpool status:

Code:

[root@ds_fnas /]# zpool status -v                                                                                                
  pool: colossus                                                                                                                  

state: ONLINE                                                                                                                   

status: One or more devices is currently being resilvered.  The pool will                                                         

        continue to function, possibly in a degraded state.                                                                      

action: Wait for the resilver to complete.                                                                                        

  scan: resilver in progress since Mon Apr 29 18:46:43 2019                                                                      

        2.03T scanned at 422M/s, 1.45T issued at 301M/s, 4.96T total                                                              

        353G resilvered, 29.16% done, 0 days 03:24:00 to go                                                                      

config:                                                                                                                           

                                                                                                                                 

        NAME                                            STATE     READ WRITE CKSUM                                               

        colossus                                        ONLINE       0     0     0                                               

          raidz1-0                                      ONLINE       0     0     0                                               

            gptid/cce2612e-cd4d-11e7-b4fd-00fd45fd9c44  ONLINE       0     0     0                                               

            gptid/200908ae-6906-11e9-bf3b-00fd45fd9c44  ONLINE       0     0     0  (resilvering)                                

            gptid/cc4efaf1-ceab-11e7-9303-00fd45fd9c44  ONLINE       0     0     0                                               

            gptid/6fff97c6-ceb0-11e7-895d-00fd45fd9c44  ONLINE       0     0     0                                               

                                                                                                                                 

errors: Permanent errors have been detected in the following files:                                                              

                                                                                                                                 

        colossus/pyramid:<0x218b5b>                                                                                              

        colossus/pyramid:<0x1b18a7>                                                                                              

        colossus/pyramid:<0x1727aa>                                                                                              

                                                                                                                                  

  pool: freenas-boot                                                                                                             

state: ONLINE                                                                                                                    

  scan: scrub repaired 0 in 0 days 00:00:55 with 0 errors on Mon Apr 22 03:45:55 2019                                            

config:                                                                                                                           

                                                                                                                                 

        NAME        STATE     READ WRITE CKSUM                                                                                    

        freenas-boot  ONLINE       0     0     0                                                                                 

          mirror-0  ONLINE       0     0     0                                                                                    

            da0p2   ONLINE       0     0     0                                                                                   

            da1p2   ONLINE       0     0     0                                                                                    

                                                                                                                                 

errors: No known data errors

A week or two ago ada1 gave a few unreadable (pending) sectors, not surprising due to the age, I took the opportunity to order a replacement 4TB WD Red 3.5" drive and start the replacement and expansion on the existing pool from 2TB to 4TB disks.

This is the long smart test output before the disk was replaced.

Code:

smartctl -a /dev/ada1 | more





smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family: Western Digital Red

Device Model: WDC WD20EFRX-68EUZN0

Serial Number: WD-WCC4M7RFR068

LU WWN Device Id: 5 0014ee 264213f62

Firmware Version: 82.00A82

User Capacity: 2,000,398,934,016 bytes [2.00 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: 5400 rpm

Device is: In smartctl database [for details use: -P show]

ATA Version is: ACS-2 (minor revision not indicated)

SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Wed Apr 24 22:01:21 2019 BST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status: (0x00) Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status: ( 121) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection: (26760) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

rrecommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 270) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x703d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.



SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 86

3 Spin_Up_Time 0x0027 172 170 021 Pre-fail Always - 4400

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 18

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 12451

10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 18

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 8

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 249

194 Temperature_Celsius 0x0022 120 102 000 Old_age Always - 27

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2

198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 2



SMART Error Log Version: 1

ATA Error Count: 3

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.



Error 3 occurred at disk power-on lifetime: 11688 hours (487 days + 0 hours)

When the command that caused the error occurred, the device was active or idle.



After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 80 b8 96 41 Error: UNC at LBA = 0x0196b880 = 26654848



Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 00 28 b8 96 41 08 38d+13:40:21.647 READ DMA

c8 00 00 28 b8 96 41 08 38d+13:40:18.251 READ DMA

c8 00 00 28 b8 96 41 08 38d+13:40:14.873 READ DMA



Error 2 occurred at disk power-on lifetime: 11688 hours (487 days + 0 hours)

When the command that caused the error occurred, the device was active or idle.



After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 80 b8 96 41 Error: UNC at LBA = 0x0196b880 = 26654848



Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 00 28 b8 96 41 08 38d+13:40:18.251 READ DMA

c8 00 00 28 b8 96 41 08 38d+13:40:14.873 READ DMA



Error 1 occurred at disk power-on lifetime: 11688 hours (487 days + 0 hours)

When the command that caused the error occurred, the device was active or idle.



After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 80 b8 96 41 Error: UNC at LBA = 0x0196b880 = 26654848



Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 00 28 b8 96 41 08 38d+13:40:14.873 READ DMA



SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 12451 26641600

# 2 Short offline Completed without error 00% 12434 -

# 3 Extended offline Completed: read failure 90% 12403 26644912



SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.



[root@ds_fnas ~]#

Prior to my replacement 4TB arriving and swapping out the faulty ada1 it was inevitable that ada2 showed a unreadable (pending) sectors error. It was only a couple initially.

This is the long smart test of ada2.

Code:


smartctl -a /dev/ada2 | more


smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family: Western Digital Red

Device Model: WDC WD20EFRX-68EUZN0

Serial Number: WD-WCC4M5YA2KX7

LU WWN Device Id: 5 0014ee 2642140ed

Firmware Version: 82.00A82

User Capacity: 2,000,398,934,016 bytes [2.00 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: 5400 rpm

Device is: In smartctl database [for details use: -P show]

ATA Version is: ACS-2 (minor revision not indicated)

SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is: Wed Apr 24 22:05:15 2019 BST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status: (0x00) Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (27720) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 280) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x703d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.



SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 154

3 Spin_Up_Time 0x0027 175 173 021 Pre-fail Always - 4241

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 38

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 12543

10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 22

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 8

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 280

194 Temperature_Celsius 0x0022 120 103 000 Old_age Always - 27

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 3

198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 15



SMART Error Log Version: 1

ATA Error Count: 7 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.



Error 7 occurred at disk power-on lifetime: 11781 hours (490 days + 21 hours)

When the command that caused the error occurred, the device was active or idle.



After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 20 26 a1 40 Error: UNC at LBA = 0x00a12620 = 10561056



Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 00 48 25 a1 40 08 38d+16:07:59.958 READ DMA

c8 00 00 48 25 a1 40 08 38d+16:07:56.586 READ DMA



Error 6 occurred at disk power-on lifetime: 11781 hours (490 days + 21 hours)

When the command that caused the error occurred, the device was active or idle.



After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 20 26 a1 40 Error: UNC at LBA = 0x00a12620 = 10561056



Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 00 48 25 a1 40 08 38d+16:07:56.586 READ DMA



Error 5 occurred at disk power-on lifetime: 7681 hours (320 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.



After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 30 02 a1 40 Error: UNC at LBA = 0x00a10230 = 10551856



Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 00 60 01 a1 40 08 7d+20:46:24.631 READ DMA

c8 00 00 60 01 a1 40 08 7d+20:46:20.657 READ DMA

c8 00 00 60 01 a1 40 08 7d+20:46:16.687 READ DMA

c8 00 00 60 01 a1 40 08 7d+20:46:12.709 READ DMA

c8 00 00 60 01 a1 40 08 7d+20:46:08.612 READ DMA



Error 4 occurred at disk power-on lifetime: 7681 hours (320 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.



After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 30 02 a1 40 Error: UNC at LBA = 0x00a10230 = 10551856



Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 00 60 01 a1 40 08 7d+20:46:20.657 READ DMA

c8 00 00 60 01 a1 40 08 7d+20:46:16.687 READ DMA

c8 00 00 60 01 a1 40 08 7d+20:46:12.709 READ DMA

c8 00 00 60 01 a1 40 08 7d+20:46:08.612 READ DMA



Error 3 occurred at disk power-on lifetime: 7681 hours (320 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.



After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 30 02 a1 40 Error: UNC at LBA = 0x00a10230 = 10551856



Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 00 60 01 a1 40 08 7d+20:46:16.687 READ DMA

c8 00 00 60 01 a1 40 08 7d+20:46:12.709 READ DMA

c8 00 00 60 01 a1 40 08 7d+20:46:08.612 READ DMA



SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 12524 -

# 2 Extended offline Completed: read failure 90% 11894 10545296

# 3 Short offline Aborted by host 90% 90 -

# 4 Short offline Aborted by host 90% 88 -

# 5 Short offline Aborted by host 90% 68 -

# 6 Extended offline Interrupted (host reset) 50% 68 -



SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

I started the process of replacing ada1 via the GUI and following the manual, all went as planned and the resilver was under way.

During the resilver the errors from ada2 increased, there were a few tens of read errors but a few hundred checksum errors.

After the resilver had completed all data, jail’s and VM’s seemed okay. I was concerned about the checksum errors, reading around I saw the mention of running a scrub on the pool, once this was completed the pool started to resilver again. Once again this completed and there were a few more errors on ada2. I left the setup for 24 hours then rebooted the FreeNAS, on boot up the resilver started once again. I took the opportunity to shut down the FreeNAS and check as best as I could the internal cabling, restated the disks & RAM then powered it back up to allow the resilver to continue.

This is the current error on ada2: Device: /dev/ada2, 48 Currently unreadable (pending) sectors

After running zpool status -v

I saw that there were three files which had permanent errors, as these were not important I removed them, this are now listed as:

errors: Permanent errors have been detected in the following files:

colossus/pyramid:<0x218b5b>
colossus/pyramid:<0x1b18a7>
colossus/pyramid:<0x1727aa>

Should these disappear on reboot or after the current resilver has completed? Could these have been causing the resilver not to complete as they were corrupt?

I have a second 4TB disk ready to replace ada2 but I’d like some guidance if at all possible as to when this would be best to do, I feel the resilver may not be completing and I could introduce data lose if not done correctly?

Time for RAIDZ2 now i have bigger disks?! Data wise i'm okay, it's the jail's and VM's i'd need to work out how best to migrate.

Thanks

SweetAndLow · Apr 29, 2019

You have metadata corruption. You are going to have to restore from backup. That's exactly what zpool status told you.

pro lamer · Apr 29, 2019

Additionally:

Have you burnt in your new disks?

Do they consume much more energy than the old ones?

djs11 said:
checksum errors

Maybe cable problems?

We have some resources about hard drive ~~errors debugging/identifying~~ troubleshooting...

Sent from my phone

djs11 · Apr 29, 2019

SweetAndLow said:
You have metadata corruption. You are going to have to restore from backup. That's exactly what zpool status told you.

Okay thanks, I assume this would be after remaking the pool?

pro lamer said:
Additionally:

Have you burnt in your new disks?

Do they consume much more energy than the old ones?

Maybe cable problems?

We have some resources about hard drive errors debugging/identifying...

Sent from my phone

Only have one of the 4TB disks installed at the moment, may have them all in there sooner if I end up remaking the pool. Once I do I should be able to tell about their power vs the older disks.

Have checked the cables as best as I could, will have another look at them and search more on the hard drive errors and debugging.
Thanks

Chris Moore · Apr 29, 2019

djs11 said:
During the resilver the errors from ada2 increased, there were a few tens of read errors but a few hundred checksum errors.

Have you been monitoring the health of your drives?

djs11 said:
SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 12451 26641600

# 2 Short offline Completed without error 00% 12434 -

# 3 Extended offline Completed: read failure 90% 12403 26644912

Theses test results indicate you had long test failures at least two times during the reporting period, but there were no additional results listed.

In the future, please use code tags for your formatted text, like this [CODE] with your text inside[/CODE] so the formatting of your listings is preserved like this:

Code:

smartctl -x /dev/da3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Desktop HDD.15
Device Model:     ST4000DM000-1F2168
Serial Number:    xxxxxxxx
LU WWN Device Id: 5 000c50 087b9ba8c
Firmware Version: CC54
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 29 17:27:23 2019 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  128) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 506) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   114   099   006    -    79794952
  3 Spin_Up_Time            PO----   091   091   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    36
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   077   060   030    -    56502090
  9 Power_On_Hours          -O--CK   084   084   000    -    14584
10 Spin_Retry_Count        PO--C-   100   100   097    -    0
12 Power_Cycle_Count       -O--CK   100   100   020    -    36
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0 0 0
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   063   055   045    -    37 (Min/Max 32/40)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    29
193 Load_Cycle_Count        -O--CK   100   100   000    -    209
194 Temperature_Celsius     -O---K   037   045   000    -    37 (0 23 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    14561h+28m+24.532s
241 Total_LBAs_Written      ------   100   253   000    -    55204558150
242 Total_LBAs_Read         ------   100   253   000    -    1148809242217
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O   1223  Current Device Internal Status Data log
0x25       GPL     R/O   1223  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      20  Device vendor specific log
0xa2       GPL     VS    4496  Device vendor specific log
0xa8       GPL,SL  VS     129  Device vendor specific log
0xa9       GPL,SL  VS       1  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xb0       GPL     VS    5176  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL,SL  VS      10  Device vendor specific log
0xc4       GPL,SL  VS       5  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     14567         -
# 2  Short offline       Completed without error       00%     14543         -
# 3  Short offline       Completed without error       00%     14519         -
# 4  Extended offline    Completed without error       00%     14487         -
# 5  Short offline       Completed without error       00%     14471         -
# 6  Extended offline    Completed without error       00%     14439         -
# 7  Short offline       Completed without error       00%     14423         -
# 8  Short offline       Completed without error       00%     14399         -
# 9  Short offline       Completed without error       00%     14375         -
#10  Short offline       Completed without error       00%     14351         -
#11  Extended offline    Completed without error       00%     14319         -
#12  Short offline       Completed without error       00%     14303         -
#13  Extended offline    Completed without error       00%     14271         -
#14  Short offline       Completed without error       00%     14255         -
#15  Short offline       Completed without error       00%     14231         -
#16  Short offline       Completed without error       00%     14207         -
#17  Short offline       Completed without error       00%     14183         -
#18  Extended offline    Completed without error       00%     14151         -
#19  Short offline       Completed without error       00%     14135         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    37 Celsius
Power Cycle Min/Max Temperature:     32/40 Celsius
Lifetime    Min/Max Temperature:     23/45 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              36  ---  Lifetime Power-On Resets
0x01  0x010  4           14584  ---  Power-on Hours
0x01  0x018  6     55512419084  ---  Logical Sectors Written
0x01  0x020  6       477772466  ---  Number of Write Commands
0x01  0x028  6    171464011046  ---  Logical Sectors Read
0x01  0x030  6      1138650378  ---  Number of Read Commands
0x01  0x038  6               -  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           14584  ---  Spindle Motor Power-on Hours
0x03  0x010  4            9716  ---  Head Flying Hours
0x03  0x018  4             209  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

Proper text formatting makes the results much easier to read.

Chris Moore · Apr 29, 2019

djs11 said:
SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 12524 -

# 2 Extended offline Completed: read failure 90% 11894 10545296

# 3 Short offline Aborted by host 90% 90 -

# 4 Short offline Aborted by host 90% 88 -

# 5 Short offline Aborted by host 90% 68 -

# 6 Extended offline Interrupted (host reset) 50% 68 -

These drives both have a massive number of errors and should have been replaced long ago if you were monitoring them.
Please review this guide for future refference:

Useful Commands
https://forums.freenas.org/index.php?threads/useful-commands.30314/#post-195192

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/

If you have not previously done so, you need to get your system configured to allow you to connect by SSH from a terminal program such as PuTTY.

You should also implement some monitoring of your system health, and here are some scripts to help with that:

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/

It would appear that your system may never have been properly configured or tested. Here is a link to a very good setup guide:

Uncle Fester's Basic FreeNAS Configuration Guide
https://www.familybrown.org/dokuwiki/doku.php?id=fester:intro

pro lamer · Apr 29, 2019

djs11 said:
Once I do I should be able to tell about their power vs the older disks.

I meant you might have needed to check whether you overloaded the PSU or not. ~~Still pay attention when attaching more and more drives (if you plan to)~~ - I am not aware of the PSU you have.
Edited because I think I was paranoid. The chassis has limited number of bays and unless you tweaked your PSU it should work - lots of people must be using this type of server and... I might have thought your PSU might have been old...

Sent from my phone

djs11 · Apr 30, 2019

Chris Moore said:
Have you been monitoring the health of your drives?

Theses test results indicate you had long test failures at least two times during the reporting period, but there were no additional results listed.

(I've tidied up my opening post, thanks for the feedback.)

I had SMART tests setup to run every few weeks, I must admit I was relying on the email notification with errors rather than reviewing the outputs. From my understanding I acted as soon as the errors were reported but I’ve clearly missed something.

Thanks for the links very helpful, i'll get some of those inplace to help.

pro lamer , I totally misunderstood what you were saying, apologies., I will look at burring the disks in to make sure they are suitable.

I'm about to look at the setup again as the resilver had completed and I’d set of a SMART long test on ada2 to see the latest.

djs11 · Apr 30, 2019

Looking at the setup now it would seem a lot healthy? My next thought would be to power cycle the server to see if the resilver starts again after the restart, if it doesn’t I’d start the replacement of ada2. Would this be the best approach?

current pool status:

Code:

[root@ds_fnas /]# zpool status -v                                                                                                   
  pool: colossus                                                                                                                   
 state: ONLINE                                                                                                                     
  scan: resilvered 1.09T in 0 days 06:37:22 with 0 errors on Tue Apr 30 01:24:05 2019                                               
config:                                                                                                                             
                                                                                                                                    
        NAME                                            STATE     READ WRITE CKSUM                                                 
        colossus                                        ONLINE       0     0     0                                                 
          raidz1-0                                      ONLINE       0     0     0                                                 
            gptid/cce2612e-cd4d-11e7-b4fd-00fd45fd9c44  ONLINE       0     0     0                                                 
            gptid/200908ae-6906-11e9-bf3b-00fd45fd9c44  ONLINE       0     0     0                                                 
            gptid/cc4efaf1-ceab-11e7-9303-00fd45fd9c44  ONLINE       0     0     0                                                 
            gptid/6fff97c6-ceb0-11e7-895d-00fd45fd9c44  ONLINE       0     0     0                                                 
                                                                                                                                    
errors: No known data errors                                                                                                       
                                                                                                                                    
  pool: freenas-boot                                                                                                               
 state: ONLINE                                                                                                                     
  scan: scrub repaired 0 in 0 days 00:00:48 with 0 errors on Tue Apr 30 03:45:48 2019                                               
config:                                                                                                                             
                                                                                                                                    
        NAME        STATE     READ WRITE CKSUM                                                                                     
        freenas-boot  ONLINE       0     0     0                                                                                   
          mirror-0  ONLINE       0     0     0                                                                                     
            da0p2   ONLINE       0     0     0                                                                                     
            da1p2   ONLINE       0     0     0                                                                                     
                                                                                                                                    
errors: No known data errors

djs11 · Apr 30, 2019

Lastest long smart test of ada2:

Code:

martctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)                                                             
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org                                                         
                                                                                                                                    
=== START OF INFORMATION SECTION ===                                                                                               
Model Family:     Western Digital Red                                                                                               
Device Model:     WDC WD20EFRX-68EUZN0                                                                                             
Serial Number:    WD-WCC4M5YA2KX7                                                                                                   
LU WWN Device Id: 5 0014ee 2642140ed                                                                                               
Firmware Version: 82.00A82                                                                                                         
User Capacity:    2,000,398,934,016 bytes [2.00 TB]                                                                                 
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                           
Rotation Rate:    5400 rpm                                                                                                         
Device is:        In smartctl database [for details use: -P show]                                                                   
ATA Version is:   ACS-2 (minor revision not indicated)                                                                             
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)                                                                           
Local Time is:    Tue Apr 30 21:19:41 2019 BST                                                                                     
SMART support is: Available - device has SMART capability.                                                                         
SMART support is: Enabled                                                                                                           
                                                                                                                                    
=== START OF READ SMART DATA SECTION ===                                                                                           
SMART overall-health self-assessment test result: PASSED                                                                           
                                                                                                                                    
General SMART Values:                                                                                                               
Offline data collection status:  (0x00) Offline data collection activity                                                           
                                        was never started.                                                                         
                                        Auto Offline Data Collection: Disabled.                                                     
Self-test execution status:      ( 121) The previous self-test completed having                                                     
                                        the read element of the test failed.                                                       
Total time to complete Offline                                                                                                     
data collection:                (27720) seconds.                                                                                   
Offline data collection                                                                                                             
capabilities:                    (0x7b) SMART execute Offline immediate.                                                           
                                        Auto Offline data collection on/off support.                                               
                                        Suspend Offline collection upon new                                                         
                                        command.                                                                                   
                                        Offline surface scan supported.                                                             
                                        Self-test supported.                                                                       
                                        Conveyance Self-test supported.                                                             
                                        Selective Self-test supported.                                                             
SMART capabilities:            (0x0003) Saves SMART data before entering                                                           
                                        power-saving mode.                                                                         
                                        Supports SMART auto save timer.                                                             
Error logging capability:        (0x01) Error logging supported.                                                                   
                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                             
recommended polling time:        (   2) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 280) minutes.                                                                                   
Conveyance self-test routine                                                                                                       
recommended polling time:        (   5) minutes.                                                                                   
SCT capabilities:              (0x703d) SCT Status supported.                                                                       
                                        SCT Error Recovery Control supported.                                                       
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                   
                                                                                                                                    
SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                   
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1312                                         
  3 Spin_Up_Time            0x0027   177   173   021    Pre-fail  Always       -       4150                                         
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       41                                           
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                           
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                           
  9 Power_On_Hours          0x0032   083   083   000    Old_age   Always       -       12686                                       
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                           
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                           
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       25                                           
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9                                           
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       285                                         
194 Temperature_Celsius     0x0022   118   103   000    Old_age   Always       -       29                                           
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                           
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       48                                           
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                           
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                           
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       22                                           
                                                                                                                                    
SMART Error Log Version: 1                                                                                                         
ATA Error Count: 177 (device log contains only the most recent five errors)                                                         
        CR = Command Register [HEX]                                                                                                 
        FR = Features Register [HEX]                                                                                               
        SC = Sector Count Register [HEX]                                                                                           
        SN = Sector Number Register [HEX]                                                                                           
        CL = Cylinder Low Register [HEX]                                                                                           
        CH = Cylinder High Register [HEX]                                                                                           
        DH = Device/Head Register [HEX]                                                                                             
        DC = Device Command Register [HEX]                                                                                         
        ER = Error register [HEX]                                                                                                   
        ST = Status register [HEX]                                                                                                 
Powered_Up_Time is measured from power on, and printed as                                                                           
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,                                                                               
SS=sec, and sss=millisec. It "wraps" after 49.710 days.                                                                             
                                                                                                                                    
Error 177 occurred at disk power-on lifetime: 12630 hours (526 days + 6 hours)                                                     
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                               After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 58 10 65 a1 40  Error: UNC 88 sectors at LBA = 0x00a16510 = 10577168                                                       
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  c8 00 58 08 65 a1 40 08      20:21:19.829  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:16.444  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:13.058  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:09.673  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:06.298  READ DMA                                                                               
                                                                                                                                    
Error 176 occurred at disk power-on lifetime: 12630 hours (526 days + 6 hours)                                                     
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 58 10 65 a1 40  Error: UNC 88 sectors at LBA = 0x00a16510 = 10577168                                                       
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  c8 00 58 08 65 a1 40 08      20:21:16.444  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:13.058  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:09.673  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:06.298  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:21:01.092  READ DMA                                                                               
                                                                                                                                    
Error 175 occurred at disk power-on lifetime: 12630 hours (526 days + 6 hours)                                                     
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 58 10 65 a1 40  Error: UNC 88 sectors at LBA = 0x00a16510 = 10577168                                                       
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  c8 00 58 08 65 a1 40 08      20:21:13.058  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:09.673  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:06.298  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:21:01.092  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:20:56.308  READ DMA                                                                               
                                                                                                                                    
Error 174 occurred at disk power-on lifetime: 12630 hours (526 days + 6 hours)                                                     
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 58 10 65 a1 40  Error: UNC 88 sectors at LBA = 0x00a16510 = 10577168                                                       
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  c8 00 58 08 65 a1 40 08      20:21:09.673  READ DMA                                                                               
  c8 00 58 08 65 a1 40 08      20:21:06.298  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:21:01.092  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:20:56.308  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:20:52.323  READ DMA                                                                               
                                                                                                                                    
Error 173 occurred at disk power-on lifetime: 12630 hours (526 days + 6 hours)                                                     
  When the command that caused the error occurred, the device was active or idle.                                                   
                                                                                                                                    
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 58 10 65 a1 40  Error: UNC 88 sectors at LBA = 0x00a16510 = 10577168                                                       
                                                                                                                                    
  Commands leading to the command that caused the error were:                                                                       
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                   
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                   
  c8 00 58 08 65 a1 40 08      20:21:06.298  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:21:01.092  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:20:56.308  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:20:52.323  READ DMA                                                                               
  c8 00 58 b0 64 a1 40 08      20:20:47.961  READ DMA                                                                               
                                                                                                                                    
SMART Self-test log structure revision number 1                                                                                     
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                     
# 1  Extended offline    Completed: read failure       90%     12671         10535208                                               
# 2  Short offline       Completed without error       00%     12668         -                                                     
# 3  Extended offline    Completed: read failure       40%     12665         10535208                                               
# 4  Extended offline    Completed: read failure       50%     12662         10535208                                               
# 5  Extended offline    Completed: read failure       90%     12659         10535208                                               
# 6  Extended offline    Completed: read failure       80%     12658         10535208                                           
# 7  Extended offline    Completed: read failure       70%     12657         10535208                                               
# 8  Extended offline    Completed: read failure       90%     12655         10535208                                               
# 9  Extended offline    Completed: read failure       90%     12653         10535208                                               
#10  Extended offline    Completed: read failure       90%     12653         10535208                                               
#11  Extended offline    Completed: read failure       90%     12651         10535208                                               
#12  Extended offline    Completed: read failure       50%     12651         10535208                                               
#13  Short offline       Completed without error       00%     12648         -                                                     
#14  Extended offline    Completed: read failure       90%     12643         10535208                                               
#15  Extended offline    Completed: read failure       90%     12642         10535208                                               
#16  Extended offline    Completed: read failure       90%     12641         10535208                                               
#17  Short offline       Completed without error       00%     12620         -                                                     
#18  Short offline       Completed without error       00%     12596         -                                                     
#19  Short offline       Completed without error       00%     12572         -                                                     
#20  Short offline       Completed without error       00%     12548         -                                                     
#21  Short offline       Completed without error       00%     12524         -                                                     
                                                                                                                                    
SMART Selective self-test log data structure revision number 1                                                                     
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.

pro lamer · Apr 30, 2019

djs11 said:
colossus/pyramid:<0x218b5b>
colossus/pyramid:<0x1b18a7>
colossus/pyramid:<0x1727aa>

Do you have any reasons they could have disappeared?

I've seen a case in our forums when a pool had had metadata corruption and later IIRC things seemed to go well but later the pool kept being unstable and finally the pool had to be re-created anyway... But maybe yours stopped being permanent due to something happened? But what? Do you have any clue?

Sent from my phone

djs11 · Apr 30, 2019

The only active steps I’ve taken was to delete the three files that’s the previous pool status reported as to have permanent errors. These were then listed as you’ve quoted, I’ve then powered down the server, replugged the disks, ram and checked the cables.
Could they or caused the resilver not to complete? I guess the test will be the next reboot which I’ll be able to do in a couple of hours.

djs11 · Apr 30, 2019

After a couple of reboots everything looks as it should, the critical error is: May 1, 2019, 7:25 a.m. - Device: /dev/ada2, 48 Currently unreadable (pending) sectors

current pool status:

Code:

[root@ds_fnas ~]# zpool status -v                                                                                                   
  pool: colossus                                                                                                                   
 state: ONLINE                                                                                                                     
  scan: resilvered 1.09T in 0 days 06:37:22 with 0 errors on Tue Apr 30 01:24:05 2019                                               
config:                                                                                                                             
                                                                                                                                    
        NAME                                            STATE     READ WRITE CKSUM                                                 
        colossus                                        ONLINE       0     0     0                                                 
          raidz1-0                                      ONLINE       0     0     0                                                 
            gptid/cce2612e-cd4d-11e7-b4fd-00fd45fd9c44  ONLINE       0     0     0                                                 
            gptid/200908ae-6906-11e9-bf3b-00fd45fd9c44  ONLINE       0     0     0                                                 
            gptid/cc4efaf1-ceab-11e7-9303-00fd45fd9c44  ONLINE       0     0     0                                                 
            gptid/6fff97c6-ceb0-11e7-895d-00fd45fd9c44  ONLINE       0     0     0                                                 
                                                                                                                                    
errors: No known data errors                                                                                                       
                                                                                                                                    
  pool: freenas-boot                                                                                                               
 state: ONLINE                                                                                                                     
  scan: scrub repaired 0 in 0 days 00:00:48 with 0 errors on Tue Apr 30 03:45:48 2019                                               
config:                                                                                                                             
                                                                                                                                    
        NAME        STATE     READ WRITE CKSUM                                                                                     
        freenas-boot  ONLINE       0     0     0                                                                                   
          mirror-0  ONLINE       0     0     0                                                                                     
            da0p2   ONLINE       0     0     0                                                                                     
            da1p2   ONLINE       0     0     0                                                                                     
                                                                                                                                    
errors: No known data errors                                                                                                       
[root@ds_fnas ~]#

As a next step i'm going to set ada2 as replace via the GUI, then shut down remove ada2 and insert another replacement 4TB disk for the resilver to start.

djs11 · May 1, 2019

Disk replacement done, resilver started.

current pool status

Code:

[root@ds_fnas ~]# zpool status -v                                                                                                   
  pool: colossus                                                                                                                   
 state: ONLINE                                                                                                                     
status: One or more devices is currently being resilvered.  The pool will                                                           
        continue to function, possibly in a degraded state.                                                                         
action: Wait for the resilver to complete.                                                                                         
  scan: resilver in progress since Wed May  1 08:09:50 2019                                                                         
        758G scanned at 499M/s, 61.0G issued at 370M/s, 4.67T total                                                                 
        15.2G resilvered, 1.27% done, 0 days 03:38:10 to go                                                                         
config:                                                                                                                             
                                                                                                                                    
        NAME                                            STATE     READ WRITE CKSUM                                                 
        colossus                                        ONLINE       0     0     0                                                 
          raidz1-0                                      ONLINE       0     0     0                                                 
            gptid/cce2612e-cd4d-11e7-b4fd-00fd45fd9c44  ONLINE       0     0     0                                                 
            gptid/200908ae-6906-11e9-bf3b-00fd45fd9c44  ONLINE       0     0     0                                                 
            gptid/0c34239d-6be0-11e9-9314-00fd45fd9c44  ONLINE       0     0     0  (resilvering)                                   
            gptid/6fff97c6-ceb0-11e7-895d-00fd45fd9c44  ONLINE       0     0     0                                                 
                                                                                                                                    
errors: No known data errors                                                                                                       
                                                                                                                                    
  pool: freenas-boot                                                                                                               
 state: ONLINE                                                                                                                     
  scan: scrub repaired 0 in 0 days 00:00:48 with 0 errors on Tue Apr 30 03:45:48 2019                                               
config:                                                                                                                             
                                                                                                                                    
        NAME        STATE     READ WRITE CKSUM                                                                                     
        freenas-boot  ONLINE       0     0     0                                                                                   
          mirror-0  ONLINE       0     0     0                                                                                     
            da0p2   ONLINE       0     0     0                                                                                     
            da1p2   ONLINE       0     0     0                                                                                     
                                                                                                                                    
errors: No known data errors                                                                                                       
[root@ds_fnas ~]#

pro lamer · May 1, 2019

djs11 said:
delete the three files that’s the previous pool status reported as to have permanent errors. These were then listed as you’ve quoted,

Just another wild idea/doubt: what happens when/after metadata files are deleted?

Sent from my phone

djs11 · May 1, 2019

The files i deleted weren’t metadata files, they where files from the volume, one was a .pdf, another .txt and the last was .m4v

The resilver has completed and the pool is reported as healthy. Are they any additional checks I can run?
I’m about to start a set of long smart tests on all the drives.

Important Announcement for the TrueNAS Community.

FreeNAS 11.1, multiple resilvers after disk replacement

djs11

Cadet

SweetAndLow

Sweet'NASty

pro lamer

Guru

djs11

Cadet

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

pro lamer

Guru

djs11

Cadet

djs11

Cadet

djs11

Cadet

pro lamer

Guru

djs11

Cadet

djs11

Cadet

djs11

Cadet

pro lamer

Guru

djs11

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

FreeNAS 11.1, multiple resilvers after disk replacement

Cadet

Sweet'NASty

Guru

Cadet

Hall of Famer

Hall of Famer

Guru

Cadet

Cadet

Cadet

Guru

Cadet

Cadet

Cadet

Guru

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS 11.1, multiple resilvers after disk replacement"

Similar threads