scrub repaired 0 (zero) in ... with 0 (zero) errors.

Status
Not open for further replies.

panz

Guru
Joined
May 24, 2013
Messages
556
Hi,

Just received this log by email

Code:
pool: alderaan
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 1h40m with 0 errors on Fri Oct 23 01:35:26 2015
config:

        NAME                                                STATE     READ WRITE CKSUM
        alderaan                                            ONLINE       0     0     0
          raidz2-0                                          ONLINE       0     0     0
            gptid/85faa6bf-0c4c-11e4-9238-0cc47a003bd9.eli  ONLINE       0     0     0
            gptid/8656fbab-0c4c-11e4-9238-0cc47a003bd9.eli  ONLINE       0     1     0
            gptid/86b39a94-0c4c-11e4-9238-0cc47a003bd9.eli  ONLINE       0     0     0
            gptid/87110d54-0c4c-11e4-9238-0cc47a003bd9.eli  ONLINE       0     0     0
            gptid/876ce98a-0c4c-11e4-9238-0cc47a003bd9.eli  ONLINE       0     0     0
            gptid/87cbcd94-0c4c-11e4-9238-0cc47a003bd9.eli  ONLINE       0     0     0

errors: No known data errors


So, scrub repaired "zero" with zero errors. What does this mean? I then committed a smartctl long test on all the disks: no problem emerged with all of them.

Another log (same date and time) says:

Code:
reenas.WORKGROUP kernel log messages:
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f1 f8 00 00 40 00 length 32768 SMID 172 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f1 98 00 00 40 00 length 32768 SMID 587 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f1 58 00 00 40 00 length 32768 SMID 137 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f2 38 00 00 40 00 length 32768 SMID 287 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f6 f0 00 00 40 00 length 32768 SMID 944 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f7 30 00 00 40 00 length 32768 SMID 397 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f7 70 00 00 40 00 length 32768 SMID 372 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f7 b0 00 00 40 00 length 32768 SMID 437 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f7 f0 00 00 40 00 length 32768 SMID 462 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f8 30 00 00 40 00 length 32768 SMID 143 terminated ioc 804b scsi 0 state 0 xfer 0
> (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 6b 33 f5 38 00 00 08 00
> (da1:mps0:0:9:0): CAM status: SCSI Status Error
> (da1:mps0:0:9:0): SCSI status: Check Condition
> (da1:mps0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
> (da1:mps0:0:9:0): Info: 0x6b33f538
> (da1:mps0:0:9:0): Error 22, Unretryable error
> GEOM_ELI: Crypto WRITE request failed (error=22). gptid/8656fbab-0c4c-11e4-9238-0cc47a003bd9.eli[WRITE(offset=918718869504, length=4096)]


I have no idea...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You had a write fail. My guess is that da1 indicates something is not right in SMART. If not, you could be experiencing a whole range of other problems, including the famed "silent corruption".
 

panz

Guru
Joined
May 24, 2013
Messages
556
Now two questions:

1) if scrub repaired zero whatever with zero errors, have I to worry?

2) Beside smartctl short, long and conveyance tests - and keeping in mind that I have stressed all the all hard drives with read/write tests before putting them in production - is there any other test I could commit?

«Silent corruption» sounds a bit frightening to me, like a defective cable or bad power...
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Well, let's see this, panz:

smartctl -x /dev/da1


What cyberjock was saying, was that he was pretty sure that once you look at the smartctl output for this drive, that we would see some obvious issues.

But, you're in SAS/HBA land, which is a bit beyond what I can speak authoritatively on, so I'll just sit back and see how this unfolds. My guess, however, is that you have a failing drive, and the smartctl output will show something to confirm that.
 

panz

Guru
Joined
May 24, 2013
Messages
556
Here it is:
Code:
 smartctl -x /dev/da1
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p15 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    
LU WWN Device Id: 5 0014ee 6599246fc
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct 23 15:11:55 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (39540) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off supp                                                                                             ort.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 397) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   174   173   021    -    6266
  4 Start_Stop_Count        -O--CK   100   100   000    -    150
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   088   088   000    -    8986
10 Spin_Retry_Count        -O--CK   100   100   000    -    0
11 Calibration_Retry_Count -O--CK   100   100   000    -    0
12 Power_Cycle_Count       -O--CK   100   100   000    -    150
192 Power-Off_Retract_Count -O--CK   200   200   000    -    149
193 Load_Cycle_Count        -O--CK   200   200   000    -    1948
194 Temperature_Celsius     -O---K   124   111   000    -    26
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 3
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] occurred at disk power-on lifetime: 8953 hours (373 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                             .

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 6b 33 f5 38 00 00  Error: IDNF at LBA = 0x6b33f538 = 1798                                                                                             567224

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 00 00 00 6b 33 f5 38 40 00  1d+04:39:09.787  WRITE FPDMA QUEUED
  b0 00 d5 00 01 00 00 00 c2 4f 01 00 00  1d+04:29:50.071  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f 06 00 00  1d+04:29:50.071  SMART READ LOG
  b0 00 d0 00 01 00 00 00 c2 4f 00 00 00  1d+04:29:50.068  SMART READ DATA
  b0 00 da 00 00 00 00 00 c2 4f 00 00 00  1d+04:29:50.066  SMART RETURN STATUS

Error 2 [1] occurred at disk power-on lifetime: 4309 hours (179 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                             .

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 26 81 56 88 40 00  Error: UNC at LBA = 0x26815688 = 64601                                                                                             0504

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 10 00 00 26 81 36 00 40 00 30d+09:10:54.736  READ FPDMA QUEUED
  60 00 c0 00 00 00 00 26 81 35 00 40 00 30d+09:10:54.736  READ FPDMA QUEUED
  60 01 00 00 10 00 00 26 81 34 00 40 00 30d+09:10:54.735  READ FPDMA QUEUED
  60 01 00 00 00 00 00 26 81 33 00 40 00 30d+09:10:54.735  READ FPDMA QUEUED
  60 01 00 00 10 00 00 26 81 32 00 40 00 30d+09:10:54.735  READ FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 3094 hours (128 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                             .

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 35 97 6c 08 00 00  Error: UNC at LBA = 0x35976c08 = 89911                                                                                             6040

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 08 00 00 45 a8 b7 b0 40 00     09:16:44.577  READ FPDMA QUEUED
  60 00 08 00 00 00 00 35 97 6c 08 40 00     09:16:44.577  READ FPDMA QUEUED
  b0 00 d5 00 01 00 00 00 c2 4f 01 00 00     09:01:31.153  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f 06 00 00     09:01:31.153  SMART READ LOG
  b0 00 d0 00 01 00 00 00 c2 4f 00 00 00     09:01:31.150  SMART READ DATA

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA                                                                                             _of_first_error
# 1  Extended offline    Completed without error       00%      8977         -
# 2  Short offline       Completed without error       00%      8970         -
# 3  Extended offline    Completed without error       00%      5135         -
# 4  Short offline       Completed without error       00%      4617         -
# 5  Extended offline    Completed without error       00%      3431         -
# 6  Short offline       Completed without error       00%      3424         -
# 7  Extended offline    Completed without error       00%      3137         -
# 8  Extended offline    Completed without error       00%      2116         -
# 9  Short offline       Completed without error       00%      2047         -
#10  Short offline       Completed without error       00%      1835         -
#11  Extended offline    Completed without error       00%      1664         -
#12  Short offline       Completed without error       00%      1614         -
#13  Short offline       Completed without error       00%      1542         -
#14  Short offline       Completed without error       00%      1447         -
#15  Short offline       Completed without error       00%      1414         -
#16  Short offline       Completed without error       00%      1371         -
#17  Short offline       Completed without error       00%      1306         -
#18  Short offline       Completed without error       00%      1229         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    26 Celsius
Power Cycle Min/Max Temperature:     23/26 Celsius
Lifetime    Min/Max Temperature:     18/39 Celsius
Under/Over Temperature Limit Count:   0/0
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (265)

Index    Estimated Time   Temperature Celsius
266    2015-10-23 07:14    33  **************
...    ..(  5 skipped).    ..  **************
272    2015-10-23 07:20    33  **************
273    2015-10-23 07:21    32  *************
...    ..( 29 skipped).    ..  *************
303    2015-10-23 07:51    32  *************
304    2015-10-23 07:52    31  ************
...    ..( 18 skipped).    ..  ************
323    2015-10-23 08:11    31  ************
324    2015-10-23 08:12    30  ***********
...    ..(  9 skipped).    ..  ***********
334    2015-10-23 08:22    30  ***********
335    2015-10-23 08:23    29  **********
...    ..(317 skipped).    ..  **********
175    2015-10-23 13:41    29  **********
176    2015-10-23 13:42     ?  -
177    2015-10-23 13:43    23  ****
178    2015-10-23 13:44    23  ****
179    2015-10-23 13:45    23  ****
180    2015-10-23 13:46    24  *****
...    ..(  2 skipped).    ..  *****
183    2015-10-23 13:49    24  *****
184    2015-10-23 13:50    25  ******
185    2015-10-23 13:51    25  ******
186    2015-10-23 13:52    26  *******
187    2015-10-23 13:53    26  *******
188    2015-10-23 13:54    31  ************
...    ..(  7 skipped).    ..  ************
196    2015-10-23 14:02    31  ************
197    2015-10-23 14:03    32  *************
...    ..( 21 skipped).    ..  *************
219    2015-10-23 14:25    32  *************
220    2015-10-23 14:26    33  **************
...    ..( 44 skipped).    ..  **************
265    2015-10-23 15:11    33  **************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4          617  Vendor specific




Thank you :)
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Here it is:
Code:
 smartctl -x /dev/da1
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p15 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:  
LU WWN Device Id: 5 0014ee 6599246fc
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct 23 15:11:55 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (39540) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off supp                                                                                             ort.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 397) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   174   173   021    -    6266
  4 Start_Stop_Count        -O--CK   100   100   000    -    150
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   088   088   000    -    8986
10 Spin_Retry_Count        -O--CK   100   100   000    -    0
11 Calibration_Retry_Count -O--CK   100   100   000    -    0
12 Power_Cycle_Count       -O--CK   100   100   000    -    150
192 Power-Off_Retract_Count -O--CK   200   200   000    -    149
193 Load_Cycle_Count        -O--CK   200   200   000    -    1948
194 Temperature_Celsius     -O---K   124   111   000    -    26
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 3
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] occurred at disk power-on lifetime: 8953 hours (373 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                             .

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 6b 33 f5 38 00 00  Error: IDNF at LBA = 0x6b33f538 = 1798                                                                                             567224

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 00 00 00 6b 33 f5 38 40 00  1d+04:39:09.787  WRITE FPDMA QUEUED
  b0 00 d5 00 01 00 00 00 c2 4f 01 00 00  1d+04:29:50.071  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f 06 00 00  1d+04:29:50.071  SMART READ LOG
  b0 00 d0 00 01 00 00 00 c2 4f 00 00 00  1d+04:29:50.068  SMART READ DATA
  b0 00 da 00 00 00 00 00 c2 4f 00 00 00  1d+04:29:50.066  SMART RETURN STATUS

Error 2 [1] occurred at disk power-on lifetime: 4309 hours (179 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                             .

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 26 81 56 88 40 00  Error: UNC at LBA = 0x26815688 = 64601                                                                                             0504

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 10 00 00 26 81 36 00 40 00 30d+09:10:54.736  READ FPDMA QUEUED
  60 00 c0 00 00 00 00 26 81 35 00 40 00 30d+09:10:54.736  READ FPDMA QUEUED
  60 01 00 00 10 00 00 26 81 34 00 40 00 30d+09:10:54.735  READ FPDMA QUEUED
  60 01 00 00 00 00 00 26 81 33 00 40 00 30d+09:10:54.735  READ FPDMA QUEUED
  60 01 00 00 10 00 00 26 81 32 00 40 00 30d+09:10:54.735  READ FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 3094 hours (128 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                             .

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 35 97 6c 08 00 00  Error: UNC at LBA = 0x35976c08 = 89911                                                                                             6040

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 08 00 00 45 a8 b7 b0 40 00     09:16:44.577  READ FPDMA QUEUED
  60 00 08 00 00 00 00 35 97 6c 08 40 00     09:16:44.577  READ FPDMA QUEUED
  b0 00 d5 00 01 00 00 00 c2 4f 01 00 00     09:01:31.153  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f 06 00 00     09:01:31.153  SMART READ LOG
  b0 00 d0 00 01 00 00 00 c2 4f 00 00 00     09:01:31.150  SMART READ DATA

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA                                                                                             _of_first_error
# 1  Extended offline    Completed without error       00%      8977         -
# 2  Short offline       Completed without error       00%      8970         -
# 3  Extended offline    Completed without error       00%      5135         -
# 4  Short offline       Completed without error       00%      4617         -
# 5  Extended offline    Completed without error       00%      3431         -
# 6  Short offline       Completed without error       00%      3424         -
# 7  Extended offline    Completed without error       00%      3137         -
# 8  Extended offline    Completed without error       00%      2116         -
# 9  Short offline       Completed without error       00%      2047         -
#10  Short offline       Completed without error       00%      1835         -
#11  Extended offline    Completed without error       00%      1664         -
#12  Short offline       Completed without error       00%      1614         -
#13  Short offline       Completed without error       00%      1542         -
#14  Short offline       Completed without error       00%      1447         -
#15  Short offline       Completed without error       00%      1414         -
#16  Short offline       Completed without error       00%      1371         -
#17  Short offline       Completed without error       00%      1306         -
#18  Short offline       Completed without error       00%      1229         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    26 Celsius
Power Cycle Min/Max Temperature:     23/26 Celsius
Lifetime    Min/Max Temperature:     18/39 Celsius
Under/Over Temperature Limit Count:   0/0
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (265)

Index    Estimated Time   Temperature Celsius
266    2015-10-23 07:14    33  **************
...    ..(  5 skipped).    ..  **************
272    2015-10-23 07:20    33  **************
273    2015-10-23 07:21    32  *************
...    ..( 29 skipped).    ..  *************
303    2015-10-23 07:51    32  *************
304    2015-10-23 07:52    31  ************
...    ..( 18 skipped).    ..  ************
323    2015-10-23 08:11    31  ************
324    2015-10-23 08:12    30  ***********
...    ..(  9 skipped).    ..  ***********
334    2015-10-23 08:22    30  ***********
335    2015-10-23 08:23    29  **********
...    ..(317 skipped).    ..  **********
175    2015-10-23 13:41    29  **********
176    2015-10-23 13:42     ?  -
177    2015-10-23 13:43    23  ****
178    2015-10-23 13:44    23  ****
179    2015-10-23 13:45    23  ****
180    2015-10-23 13:46    24  *****
...    ..(  2 skipped).    ..  *****
183    2015-10-23 13:49    24  *****
184    2015-10-23 13:50    25  ******
185    2015-10-23 13:51    25  ******
186    2015-10-23 13:52    26  *******
187    2015-10-23 13:53    26  *******
188    2015-10-23 13:54    31  ************
...    ..(  7 skipped).    ..  ************
196    2015-10-23 14:02    31  ************
197    2015-10-23 14:03    32  *************
...    ..( 21 skipped).    ..  *************
219    2015-10-23 14:25    32  *************
220    2015-10-23 14:26    33  **************
...    ..( 44 skipped).    ..  **************
265    2015-10-23 15:11    33  **************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4          617  Vendor specific




Thank you :)
Right, so, you have quite a few errors in here. The SMART attributes look fine. However, there's quite a few FPDMA read errors in there, and a couple of other things that suggest some interesting screwups happening.

More than likely, this is bad. You almost certainly have a problem with either the drive controller circuitry, or the HBA serving the drive (if any). There are small possibilities in a case like this that the problem could be a data cable to the drive.

Whenever this stuff starts happening, then eventually there are problems with data corruption and writing to the drive (you've already had one write fail according to your zpool status).

I would rule out a bad data cable, and then replace the drive. This kind of stuff is almost always hardware related with the drive.
 

panz

Guru
Joined
May 24, 2013
Messages
556
Another drive (da0) has some SMART read/write LOG errors:

Code:
[root@freenas ~]# smartctl -x /dev/da0 | less
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p15 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    
LU WWN Device Id: 5 0014ee 659924f74
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct 23 17:09:00 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
:...skipping...
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p15 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WMC4N2004523
LU WWN Device Id: 5 0014ee 659924f74
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct 23 17:09:00 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (41040) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 412) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   179   177   021    -    6025
  4 Start_Stop_Count        -O--CK   100   100   000    -    150
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   088   088   000    -    8988
10 Spin_Retry_Count        -O--CK   100   100   000    -    0
11 Calibration_Retry_Count -O--CK   100   100   000    -    0
12 Power_Cycle_Count       -O--CK   100   100   000    -    150
192 Power-Off_Retract_Count -O--CK   200   200   000    -    149
193 Load_Cycle_Count        -O--CK   200   200   000    -    2001
194 Temperature_Celsius     -O---K   122   111   000    -    28
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 25 (device log contains only the most recent 24 errors)
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 25 [0] occurred at disk power-on lifetime: 2129 hours (88 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 01 00 00 00 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.186  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.186  SMART READ LOG
  2f 00 00 00 01 00 00 00 00 00 11 40 00     23:45:29.186  READ LOG EXT
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.186  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.185  SMART READ LOG

Error 24 [23] occurred at disk power-on lifetime: 2129 hours (88 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 01 00 00 00 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.186  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.185  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.185  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.185  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.185  SMART READ LOG

Error 23 [22] occurred at disk power-on lifetime: 2129 hours (88 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 01 00 00 00 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.185  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.185  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.185  SMART READ LOG
  b0 00 d6 00 01 00 00 00 c2 4f e0 00 00     23:45:29.185  SMART WRITE LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.185  SMART READ LOG

Error 22 [21] occurred at disk power-on lifetime: 2129 hours (88 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 01 00 00 00 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.185  SMART READ LOG
  b0 00 d6 00 01 00 00 00 c2 4f e0 00 00     23:45:29.185  SMART WRITE LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.185  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.184  SMART READ LOG
  b0 00 d6 00 01 00 00 00 c2 4f e0 00 00     23:45:29.184  SMART WRITE LOG

Error 21 [20] occurred at disk power-on lifetime: 2129 hours (88 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 01 00 00 00 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.184  SMART READ LOG
  b0 00 d6 00 01 00 00 00 c2 4f e0 00 00     23:45:29.184  SMART WRITE LOG
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.184  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.184  SMART READ LOG
  b0 00 d6 00 01 00 00 00 c2 4f e0 00 00     23:45:29.183  SMART WRITE LOG

Error 20 [19] occurred at disk power-on lifetime: 2129 hours (88 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 01 00 00 00 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.184  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.184  SMART READ LOG
  b0 00 d6 00 01 00 00 00 c2 4f e0 00 00     23:45:29.183  SMART WRITE LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.183  SMART READ LOG
  b0 00 d6 00 01 00 00 00 c2 4f e0 00 00     23:45:29.183  SMART WRITE LOG

Error 19 [18] occurred at disk power-on lifetime: 2129 hours (88 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 01 00 00 00 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.180  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.180  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.179  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.179  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.179  SMART READ LOG

Error 18 [17] occurred at disk power-on lifetime: 2129 hours (88 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  04 -- 51 00 01 00 00 00 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.178  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e0 00 00     23:45:29.178  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f 09 00 00     23:45:29.178  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f e1 00 00     23:45:29.178  SMART READ LOG
  b0 00 d5 00 01 00 00 00 c2 4f 09 00 00     23:45:29.178  SMART READ LOG

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8978         -
# 2  Short offline       Completed without error       00%      8970         -
# 3  Short offline       Completed without error       00%      8954         -
# 4  Extended offline    Completed without error       00%      8943         -
# 5  Short offline       Completed without error       00%      8914         -
# 6  Extended offline    Completed without error       00%      8903         -
# 7  Short offline       Completed without error       00%      8794         -
# 8  Short offline       Completed without error       00%      8674         -
# 9  Extended offline    Completed without error       00%      8664         -
#10  Short offline       Completed without error       00%      8621         -
#11  Short offline       Completed without error       00%      8533         -
#12  Extended offline    Completed without error       00%      8523         -
#13  Short offline       Completed without error       00%      8388         -
#14  Short offline       Completed without error       00%      8297         -
#15  Extended offline    Completed without error       00%      8286         -
#16  Short offline       Completed without error       00%      8177         -
#17  Short offline       Completed without error       00%      8057         -
#18  Extended offline    Completed without error       00%      8046         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    28 Celsius
Power Cycle Min/Max Temperature:     22/28 Celsius
Lifetime    Min/Max Temperature:     18/39 Celsius
Under/Over Temperature Limit Count:   0/0
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (410)

Index    Estimated Time   Temperature Celsius
411    2015-10-23 09:12    28  *********
...    ..(268 skipped).    ..  *********
202    2015-10-23 13:41    28  *********
203    2015-10-23 13:42    29  **********
204    2015-10-23 13:43     ?  -
205    2015-10-23 13:44    22  ***
206    2015-10-23 13:45    22  ***
207    2015-10-23 13:46    23  ****
208    2015-10-23 13:47    23  ****
209    2015-10-23 13:48    24  *****
...    ..(  2 skipped).    ..  *****
212    2015-10-23 13:51    24  *****
213    2015-10-23 13:52    25  ******
214    2015-10-23 13:53    25  ******
215    2015-10-23 13:54    25  ******
216    2015-10-23 13:55    26  *******
...    ..(  7 skipped).    ..  *******
224    2015-10-23 14:03    26  *******
225    2015-10-23 14:04    27  ********
...    ..( 24 skipped).    ..  ********
250    2015-10-23 14:29    27  ********
251    2015-10-23 14:30    28  *********
...    ..( 80 skipped).    ..  *********
332    2015-10-23 15:51    28  *********
333    2015-10-23 15:52    30  ***********
...    ..( 10 skipped).    ..  ***********
344    2015-10-23 16:03    30  ***********
345    2015-10-23 16:04    29  **********
...    ..( 46 skipped).    ..  **********
392    2015-10-23 16:51    29  **********
393    2015-10-23 16:52    28  *********
...    ..( 16 skipped).    ..  *********
410    2015-10-23 17:09    28  *********

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4         7636  Vendor specific


This is a pool of 6 drives. The others seem fine.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Code:
> (da1:mps0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
> (da1:mps0:0:9:0): Info: 0x6b33f538
> (da1:mps0:0:9:0): Error 22, Unretryable error
> GEOM_ELI: Crypto WRITE request failed (error=22). gptid/8656fbab-0c4c-11e4-9238-0cc47a003bd9.eli[WRITE(offset=918718869504, length=4096)]


I have no idea...

Well, that write offset is crazy way off for a 3TB drive. It's trying to write out a block at about the 3.7TB mark.

Right, so, you have quite a few errors in here. The SMART attributes look fine. However, there's quite a few FPDMA read errors in there, and a couple of other things that suggest some interesting screwups happening.

Where do you see that? I'm only seeing three errors, and three errors each separated by months is ... not awesome but not horrifying.

da0 has some problems. I'm thinking either ZFS freaked at the da1 write error or at the steady stream of ABRT's on da0. If the pool scrubbed fine and the extended SMART tests are not showing any issues, I think your data is fine but that some inspection of da0 is in order.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Now two questions:

1) if scrub repaired zero whatever with zero errors, have I to worry?

2) Beside smartctl short, long and conveyance tests - and keeping in mind that I have stressed all the all hard drives with read/write tests before putting them in production - is there any other test I could commit?

«Silent corruption» sounds a bit frightening to me, like a defective cable or bad power...

1. No errors is ideal.
2. That's about all you can realisitically do. Other than that, just monitor it for the future.

Silent corruption is something you've no doubt run into over the years, but have been unable to prove it until ZFS came about. So while its scary, I have no doubt you've seen it before with some corrupt file somewhere and didn't have a way to quantify or even prove the cause. Its really nothing to worry about unless it's happening regularly or in large quantities. At that point you should definitely start troubleshooting the storage subsystem to determine the cause.
 

panz

Guru
Joined
May 24, 2013
Messages
556
So,

1) is ZFS keeping my data safe?

2) Do I have to replace da0 and da1?

@ jgreco: the "crazy" write offset: how could this happen? Drive's firmware is buggy?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
1) Yes

2) I think da0's prolly fine. Keep an eye on da1, maybe dig deeper.

Unclear on what caused the crazy write offset. I don't delve that deeply into the software these days.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Concur with jgreco.
 

panz

Guru
Joined
May 24, 2013
Messages
556
Adding just another piece to the puzzle: all the log messages were generated at 3:00 a.m., when FreeNAS does its control routines.
 

TXAG26

Patron
Joined
Sep 20, 2013
Messages
310
I'd try replacing the SATA cable to that drive with a new one, as well as the power cable. I had a round of bad luck with a couple of break-out cables giving up the ghost. Last thing I would have thought to cause random errors, but was kicking myself for not swapping them out first. I'd say the likelihood of a LSI HBA randomly going bad would be pretty small in my book. Faulty cables and PSU/power connectors are more likely and easier to track down and fix. Good luck!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I honestly don't think I'd touch da0. All those ABRT's were clustered together and many months ago. It doesn't look like there's anything wrong.
 

panz

Guru
Joined
May 24, 2013
Messages
556
I'd try replacing the SATA cable to that drive with a new one [cut]

HDs are connected directly to the (their) backplane. Every backplane hosts 4 drives and a single SFF-8087 cable goes directly from the backplane to the Intel RES2SV240 SAS/SATA Expander card.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I think I'd leave it alone. It doesn't seem like there's much going on that would raise true concern except the one bizarre write error. The error correction capabilities in ZFS were actually designed to deal with exactly that sort of case, the rare error that occurs for no obvious reason. You're more likely to cause trouble if you start tinkering with what appears to be a stable system that threw an odd error.
 

panz

Guru
Joined
May 24, 2013
Messages
556
Update 2016-01-04:

Code:
Checking status of zfs pools:
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
albedo     928G  11.5G   916G     1%  1.00x  ONLINE  /mnt
alderaan  16.2T  3.00T  13.3T    18%  1.00x  ONLINE  /mnt

  pool: alderaan
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 276K in 0h0m with 0 errors on Thu Dec 31 20:01:46 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        alderaan                                        ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/71b2b4ca-7e90-11e5-b81e-0cc47a003bd9  ONLINE       0     0     0
            gptid/72107f31-7e90-11e5-b81e-0cc47a003bd9  ONLINE       0     1     0
            gptid/72708fc1-7e90-11e5-b81e-0cc47a003bd9  ONLINE       0     0     0
            gptid/72cea8d1-7e90-11e5-b81e-0cc47a003bd9  ONLINE       0     0     0
            gptid/732c7c18-7e90-11e5-b81e-0cc47a003bd9  ONLINE       0     0     0
            gptid/738ad70d-7e90-11e5-b81e-0cc47a003bd9  ONLINE       0     0     0

errors: No known data errors


So, gptid/72107f31-7e90-11e5-b81e-0cc47a003bd9 ONLINE 0 1 0

reports a "1" instead of all zero, like other disks.

I have no idea.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If you type "dmesg" or inspect the system log, does it show any disk errors?
 
Status
Not open for further replies.
Top