New disks degraded

dias

Dabbler
Joined
Nov 3, 2022
Messages
24
Hi,
I am a newbie at truenas, just installed scale. I have 2 pools.

System:
TrueNAS-SCALE-22.02.4 (Baremetal)

Pool 1:
2x 12 tb seagate exos mirror

Pool 2:
2x 8 tb wd red pro mirror

Main boot drive:
240 gb sandisk ssd

Board:
Msi z170a gaming m7

Ethernet Card:
Intel 82599En 10 gbit spf+

Cpu:
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz

Ram:
1x8 gb ram Non-ecc ram (dont know brand)

Case:
Thermaltake level 10 GTS

Case,Raid drives and nic are brand new others are old stuff.

I am getting degraded notification from one of my wd red pro drive (sdb). I ran both short and long smart tests for this drive and it was success. Pool seems to be working fine other than that. I can transfer files to this pool from lan but it was annoying me to see that error on new disks. I tried to change sata cables to see if there was an error with cables but it made that pool unavaible then so i changed the cables back. Is it normal to see degraded notifications on new drive? How can i troubleshoot this?

Vpool Status:

Code:

root@truenas[~]# zpool status -v
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sda3      ONLINE       0     0     0

errors: No known data errors

  pool: seagate
 state: ONLINE
config:

        NAME                                      STATE     READ WRITE CKSUM
        seagate                                   ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            4ced2a17-6b88-4bec-a301-ae54cb41d627  ONLINE       0     0     0
            fbf0eef0-74af-4c71-9702-6371eaf3fbb1  ONLINE       0     0     0

errors: No known data errors

  pool: wd
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 712K in 00:00:19 with 0 errors on Wed Nov  2 23:47:38 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        wd                                        DEGRADED     0     0     0
          mirror-0                                DEGRADED     0     0     0
            b9e99684-cab8-4adf-add3-dacadae436ef  FAULTED      0     3     0  too many errors
            d16360d8-1281-4b39-bfc0-585bb9f5f37a  ONLINE       0     0     0

errors: No known data errors
root@truenas[~]# 





Here is the smart results:



Code:
root@truenas[~]# smartctl -qnoserial -x /dev/sdb         
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red Pro
Device Model:     WDC WD8003FFBX-68B9AN0
Firmware Version: 83.00A83
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Nov  3 14:58:56 2022 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   87) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 889) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  --S---   132   132   054    -    96
  3 Spin_Up_Time            POS---   157   157   024    -    537 (Average 496)
  4 Start_Stop_Count        -O--C-   100   100   000    -    284
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         -O-R--   100   100   067    -    0
  8 Seek_Time_Performance   --S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   100   100   000    -    607
 10 Spin_Retry_Count        -O--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    26
192 Power-Off_Retract_Count -O--CK   100   100   000    -    317
193 Load_Cycle_Count        -O--C-   100   100   000    -    317
194 Temperature_Celsius     -O----   120   120   000    -    50 (Min/Max 21/53)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   196   196   000    -    29
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   5501  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    256  Current Device Internal Status Data log
0x25       GPL     R/O    256  Saved Device Internal Status Data log
0x2f       GPL     -        1  Set Sector Configuration
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 29 (device log contains only the most recent 4 errors)
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 29 [0] occurred at disk power-on lifetime: 592 hours (24 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 43 00 00 00 00 00 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 10 00 80 00 00 00 40 02 90 40 08     00:01:13.651  WRITE FPDMA QUEUED
  61 00 10 00 78 00 03 a3 81 26 90 40 08     00:01:13.650  WRITE FPDMA QUEUED
  61 00 10 00 70 00 03 a3 81 28 90 40 08     00:01:13.650  WRITE FPDMA QUEUED
  47 00 00 00 01 00 00 00 00 00 12 a0 08     00:01:13.635  READ LOG DMA EXT
  47 00 00 00 01 00 00 00 00 00 00 a0 08     00:01:13.634  READ LOG DMA EXT

Error 28 [3] occurred at disk power-on lifetime: 592 hours (24 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 43 00 00 00 00 00 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 10 00 40 00 00 00 40 02 90 40 08     00:01:13.202  WRITE FPDMA QUEUED
  61 00 10 00 50 00 03 a3 81 28 90 40 08     00:01:13.202  WRITE FPDMA QUEUED
  61 00 10 00 48 00 03 a3 81 26 90 40 08     00:01:13.202  WRITE FPDMA QUEUED
  47 00 00 00 01 00 00 00 00 00 12 a0 08     00:01:13.187  READ LOG DMA EXT
  47 00 00 00 01 00 00 00 00 00 00 a0 08     00:01:13.186  READ LOG DMA EXT

Error 27 [2] occurred at disk power-on lifetime: 592 hours (24 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 43 00 00 00 00 00 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 10 00 70 00 00 00 40 02 90 40 08     00:01:12.779  WRITE FPDMA QUEUED
  61 00 10 00 68 00 03 a3 81 26 90 40 08     00:01:12.779  WRITE FPDMA QUEUED
  61 00 10 00 60 00 03 a3 81 28 90 40 08     00:01:12.779  WRITE FPDMA QUEUED
  47 00 00 00 01 00 00 00 00 00 12 a0 08     00:01:12.777  READ LOG DMA EXT
  47 00 00 00 01 00 00 00 00 00 00 a0 08     00:01:12.777  READ LOG DMA EXT

Error 26 [1] occurred at disk power-on lifetime: 592 hours (24 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  84 -- 43 00 00 00 00 00 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 10 00 18 00 00 00 40 02 90 40 08     00:01:12.411  WRITE FPDMA QUEUED
  61 00 10 00 28 00 03 a3 81 28 90 40 08     00:01:12.411  WRITE FPDMA QUEUED
  61 00 10 00 20 00 03 a3 81 26 90 40 08     00:01:12.411  WRITE FPDMA QUEUED
  60 00 10 00 c0 00 03 a3 81 28 90 40 08     00:01:12.411  READ FPDMA QUEUED
  60 00 10 00 b8 00 03 a3 81 26 90 40 08     00:01:12.411  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       606         -
# 2  Short offline       Completed without error       00%       592         -
# 3  Short offline       Completed without error       00%       560         -
# 4  Short offline       Completed without error       00%         7         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        Active (0)
Current Temperature:                    50 Celsius
Power Cycle Min/Max Temperature:     46/53 Celsius
Lifetime    Min/Max Temperature:     21/53 Celsius
Under/Over Temperature Limit Count:   0/0

Write SCT Data Table failed: scsi error badly formed scsi parameters
Read SCT Temperature History failed

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              26  ---  Lifetime Power-On Resets
0x01  0x010  4             607  ---  Power-on Hours
0x01  0x018  6     23424995762  ---  Logical Sectors Written
0x01  0x020  6       132433109  ---  Number of Write Commands
0x01  0x028  6      7144645036  ---  Logical Sectors Read
0x01  0x030  6        23174675  ---  Number of Read Commands
0x01  0x038  6      2185910450  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4             488  ---  Spindle Motor Power-on Hours
0x03  0x010  4             488  ---  Head Flying Hours
0x03  0x018  4             317  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              29  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              50  ---  Current Temperature
0x05  0x010  1              51  N--  Average Short Term Temperature
0x05  0x018  1              45  N--  Average Long Term Temperature
0x05  0x020  1              53  ---  Highest Temperature
0x05  0x028  1              21  ---  Lowest Temperature
0x05  0x030  1              51  N--  Highest Average Short Term Temperature
0x05  0x038  1              25  N--  Lowest Average Short Term Temperature
0x05  0x040  1              45  N--  Highest Average Long Term Temperature
0x05  0x048  1              25  N--  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             168  ---  Number of Hardware Resets
0x06  0x010  4              33  ---  Number of ASR Events
0x06  0x018  4              30  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            5  Command failed due to ICRC error
0x0002  2            5  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            5  R_ERR response for host-to-device data FIS
0x0005  2            4  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            4  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            8  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            8  Device-to-host register FISes sent due to a COMRESET
0x000b  2            5  CRC errors within host-to-device FIS
0x000d  2            4  Non-CRC errors within host-to-device FIS
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
You have SATA interface CRC errors on that disk. Probably a bad cable and/or bad connection. Reset the cable on both ends. If that doesn't work, replace it and keep in mind that SATA is limited to 1 meter on paper - reality is that even at 0.5 m, things may not work as well as they should, so try and keep cable lengths to a minimum.
 

dias

Dabbler
Joined
Nov 3, 2022
Messages
24
You have SATA interface CRC errors on that disk. Probably a bad cable and/or bad connection. Reset the cable on both ends. If that doesn't work, replace it and keep in mind that SATA is limited to 1 meter on paper - reality is that even at 0.5 m, things may not work as well as they should, so try and keep cable lengths to a minimum.
Hi,
Thanks for the quick reply. Cables are 0.5m long as far as i know, i will buy and try with another sata cable? Should i be worried about power cables? All four hdds are getting power from same cable because of case hard disk swap design, i dont think it is a power cable issue as only one hdd showed error on the same slot but still always best to ask :)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
All four hdds are getting power from same cable because of case hard disk swap design, i dont think it is a power cable issue as only one hdd showed error on the same slot but still always best to ask :)

If they're getting power through an Amp Mate-n-lock (typically referred to as "Molex" power), the pin ratings are more than 10A (I think 11A actually), so that should run the 12V for four drives just fine. Near the limit though.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
A lot of connectors are actually rated well below the original spec, so that may still be a concern. That said, it's unlikely to be related to the CRC errors.
 

dias

Dabbler
Joined
Nov 3, 2022
Messages
24
Thank you all fop the help, pool seems to be fine for 2-3 days since i changed the sata cable.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Thank you all fop the help, pool seems to be fine for 2-3 days since i changed the sata cable.

In the future, it would be a good idea to do your system burn-in testing prior to loading TrueNAS. It is more difficult to debug problems on a running system, also more dangerous.

 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
194 Temperature_Celsius -O---- 120 120 000 - 50 (Min/Max 21/53)

That's a little warm for my tastes, although I do know the Red Pro disks run warmer than what I'm expecting. Keep an eye on this under load (scrub or burn-in testing) and make sure it doesn't go up much from here.
 

dias

Dabbler
Joined
Nov 3, 2022
Messages
24
In the future, it would be a good idea to do your system burn-in testing prior to loading TrueNAS. It is more difficult to debug problems on a running system, also more dangerous.

Is there a specific burnin test you can advice? I made a long smart test after the first error and it was success.
 

dias

Dabbler
Joined
Nov 3, 2022
Messages
24
That's a little warm for my tastes, although I do know the Red Pro disks run warmer than what I'm expecting. Keep an eye on this under load (scrub or burn-in testing) and make sure it doesn't go up much from here.
it was like that after the long smart test i guess normally it was lower but yes they both run hotter than seagate exos drives.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Is there a specific burnin test you can advice? I made a long smart test after the first error and it was success.

Yes, I linked to the burn-in test sticky. You do all of it.

 
Top