FreeNAS errors/crashes/hangs after a hard drive went bad, zpool scrub finishes in 1 second.

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
A few days ago I got an email alert from FreeNAS stating that one of the drives SMART tests could not be read. I stupidly just restarted the server hoping it would clear (which it did temporarily). Anyway, that night I was watching Plex before I fell asleep and everything was working fine. The next morning I tried to access Plex but the server was down. I had received another email alert stating, "pool I/O is currently suspended." The web UI was inaccessible, as was ssh. I hooked a monitor and keyboard to the server and tried to shutdown from there. It seemed to hang stopping cron and zfsd processes, before i got a 90 second watchdog timeout error, and message about ps axl advised. At that point I hard powered off the server, unplugged the culprit hdd and turned the server back on.

The pool came back online seemingly normal aside from an alert that it had experienced some unrecoverable errors. I could access the files I attempted to and was able to connect to my VPN. I tried a zpool scrub and although I have approximately 20TB filled the scrub finished in 1 second. and listed 489 errors. They were all from jails, 98% of them were from my plex jail, and then couple other jails had 1 or 2 each. I deleted the my plex jail entirely (I was meaning to upgrade it to an 11.2 jail anyway), as well as another jail and deleted the offending log file from my VPN jail. Scrub now reported 0 errors in the 1 second it took to run.

Anyway, since then FreeNAS becomes inaccessible after a little while. Again, I can't access the UI and have to shutdown the server from root.

Please, help, lol. Where do I start? Can I reinstall FreeNAS? The short scrub issue may not be tied to this, I had made a post a few weeks ago that I thought scrubs were finishing too fast...but I never manually tried to do one to check the time they finished in.

Thank you

SUPERMICRO MBD-X9SCL-F-O Intel Xeon E3 Server
Intel Xeon E3-1230 V2
Kingston Technology ValueRAM 32GB DDR3 1600MHz PC3 12800 ECC
WD RED (8x) 10TB
IBM ServeRAID M1015 (crossflashed into IT Mode - FW P20) (All eight HDs are plugged into it)

FreeNAS-11.2-U2.1
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Please, help, lol. Where do I start? Can I reinstall FreeNAS? The short scrub issue may not be tied to this, I had made a post a few weeks ago that I thought scrubs were finishing too fast
A scrub should take hours to complete. Even on my home NAS that only has around 12 TB of data, a scrub takes around 4 hours to finish.
At that point I hard powered off the server, unplugged the culprit hdd and turned the server back on.
How did you decide what drive was causing the problem?

What kind of boot media are you using? Normally, when the GUI is affected, it is the boot pool that is the problem, not the data pool.
 

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
A scrub should take hours to complete. Even on my home NAS that only has around 12 TB of data, a scrub takes around 4 hours to finish.
Yeah, I mentioned this is my previous thread, but scrubs started finishing quickly rather than hours/days after I replaced all my 3TB hdds with 10TB drives back in November. (By replacing one at a time and resilvering them). I used to get an email that the scrub started one day, and wouldn't get the scrub finished email until a day or two later. Since I enlarged the pool I was getting both emails the same day.

How did you decide what drive was causing the problem?
After the initial email stating the SMART results couldn't be retrieved from a drive, I checked the disks in Storage section of the GUI and found da5 was faulted.. I then manually checked the smart results of da5 in ssh and got the serial number.
What kind of boot media are you using? Normally, when the GUI is affected, it is the boot pool that is the problem, not the data pool.
It's a 120GB Samsung SSD. I got sick of burning through flash drives once a year and just bought an SSD hoping it would be more reliable...but that is an interesting point and makes sense that UI errors are usually boot drive related.

That drive has been scrubbing fine too, but i didn't check how long they take to complete (I always got those start/finished emails the same day since there's so little data on it).
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
IBM ServeRAID M1015 (crossflashed into IT Mode - FW P20) (All eight HDs are plugged into it)
From the terminal (SSH) I would like to see the output of sas2flash -list because I have a concern about the drive controller. That should look similar to this:
Code:
root@Emily-NAS:/tmp # sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2308_2(D1)

        Controller Number              : 0
        Controller                     : SAS2308_2(D1)
        PCI Address                    : 00:03:00:00
        SAS Address                    : 500605b-0-09ef-7220
        NVDATA Version (Default)       : 14.01.00.06
        NVDATA Version (Persistent)    : 14.01.00.06
        Firmware Product ID            : 0x2214 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9207-8i
        BIOS Version                   : N/A
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : SAS9207-8i
        Board Assembly                 : H3-25412-00J
        Board Tracer Number            : SV45308383

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
 

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
From the terminal (SSH) I would like to see the output of sas2flash -list because I have a concern about the drive controller. That should look similar to this:

Code:
 # sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

        Controller Number              : 0
        Controller                     : SAS2008(B2)
        PCI Address                    : 00:01:00:00
        SAS Address                    : 500605b-0-0366-e570
        NVDATA Version (Default)       : 14.01.00.08
        NVDATA Version (Persistent)    : 14.01.00.08
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9211-8i
        BIOS Version                   : N/A
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : SAS9211-8i
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

I think it looks good?

On a whim (which I realize is risky when you don't know 100% what you're doing with FreeNAS), I've disconnected the pool from FreeNAS...I figure I'll let this sit for a while and see if the GUI freezes up again..at a minimum I'm hoping it protects/isolates my data if I have to hard restart the server again.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
No issues with the GUI overnight. I reimported the pool and that fixed the scrub issue at least, it's now taking approximately 6 hours to scrub the 23.8TB.

The pool is still in a degraded state since I'm waiting for the Advanced RMA hdd. If it does hang again, where would I find the logs I need to see what happened?

Thanks
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
When you exported the pool, the system will have moved the system dataset automatically as a part of that process. It may be that the system dataset is on the boot pool now, and that is fine, because you are using an SSD for the boot pool.

Generally speaking, the log files should be stored under the directory /var/logs but I am not sure if the information you need will be logged.
 

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
Ughh... So I got an email alert "The volume volume1 state is DEGRADED: One or more devices are faulted in response to IO failures."

The scrub appeared to be stuck @ 33.76%, with 278 errors. But the scary part is zpool status showed 3 drives as REMOVED. With a 4th drive (the one I unplugged) as unavailable. Since the drives are physically connected I ran zpool clear which brought all the drives back online, but now it's resilvering for some reason? Actually it just finished, the "resilvering" only took 3 minutes??

I'm wondering if I stop everything until at least I get a replacement drive back in there so that I'm not already messing with stuff down a drive... But I also don't know how reslivering is going to work if another drive is faulted.

zpool status -v is no longer showing the errors it had initally. Which were:
Code:
errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x136>
        <metadata>:<0x179>
        <metadata>:<0x17d>
        <metadata>:<0x17f>
        <metadata>:<0x180>
        <metadata>:<0x182>
        <metadata>:<0x1df>
        <metadata>:<0x2f3>
        <metadata>:<0x2f5>
        <metadata>:<0x2f6>
        <metadata>:<0x2f9>
        <metadata>:<0xff>
        /mnt/volume1/Media/Movies/XXXXX/XXXXXX.mkv
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Ughh... So I got an email alert "The volume volume1 state is DEGRADED: One or more devices are faulted in response to IO failures."

What kind of airflow have you got going over that SAS controller? I have seen some real bad behavior out of those when they overheat.
 

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
What kind of airflow have you got going over that SAS controller? I have seen some real bad behavior out of those when they overheat.
It should be OK, It's a smallish box for 8 drives but there are 3 fans going and the cover has been off the side since this started.

I tried to scrub again, and again it got stuck with I/O errors and then I got the following two zpool status results:

Note: The UNAVAIL drive was the one I thought was bad but have since replugged in. The REMOVED drives are still connected, just showed up as removed during the scrub for some reason.

1st:
Code:
 NAME                                            STATE     READ WRITE CKSUM
        volume1                                         UNAVAIL  1.75K    72     0
          raidz2-0                                      UNAVAIL  3.49K     6     0
            gptid/676cd1f2-f4e6-11e8-8bf8-002590d65107  ONLINE       0     0     0
            gptid/13ed5036-f35c-11e8-b6dc-002590d65107  ONLINE       0     0     0
            gptid/d6006dcf-f1b2-11e8-9ccb-002590d65107  FAULTED    229   378     0  too many errors
            12232178289007768960                        REMOVED      0     0     0  was /dev/gptid/818a27df-f135-11e8-ab2a-002590d65107
            2642898098408983189                         REMOVED      0     0     0  was /dev/gptid/0d8dfe0e-f290-11e8-a1a0-002590d65107
            gptid/55d6e72d-f3d3-11e8-83fe-002590d65107  ONLINE       0     0     0
            5426698041747887479                         UNAVAIL      0     0     0  was /dev/gptid/03a0acac-f428-11e8-afa9-002590d65107
            gptid/3e598cb8-f499-11e8-8050-002590d65107  ONLINE       0     0     0


And second, after it was attempting another small (11.0MB) resilvering after a zpool clear of the above (stuck) scrub.
Code:
NAME                                            STATE     READ WRITE CKSUM
        volume1                                         UNAVAIL  1.75K    72     0
          raidz2-0                                      UNAVAIL  3.49K     6     0
            gptid/676cd1f2-f4e6-11e8-8bf8-002590d65107  ONLINE       0     0     0
            gptid/13ed5036-f35c-11e8-b6dc-002590d65107  ONLINE       0     0     0
            14380211785926036504                        REMOVED      0     0     0  was /dev/gptid/d6006dcf-f1b2-11e8-9ccb-002590d65107
            gptid/818a27df-f135-11e8-ab2a-002590d65107  FAULTED      0   161     0  too many errors
            gptid/0d8dfe0e-f290-11e8-a1a0-002590d65107  ONLINE       0     0     0
            gptid/55d6e72d-f3d3-11e8-83fe-002590d65107  ONLINE       0     0     0
            5426698041747887479                         UNAVAIL      0     0     0  was /dev/gptid/03a0acac-f428-11e8-afa9-002590d65107
            gptid/3e598cb8-f499-11e8-8050-002590d65107  ONLINE       0     0     0


Here are the smart test results of those two FAULTED drives. (I began new long tests on them now.)

Code:
root@freenas:~ # smartctl -x /dev/da0
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    XXXXXXXX
LU WWN Device Id: 5 000cca 273e05f2b
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Mar 19 16:47:20 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   93) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1164) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  --S---   129   129   054    -    112
  3 Spin_Up_Time            POS---   253   253   024    -    215 (Average 130)
  4 Start_Stop_Count        -O--C-   100   100   000    -    54
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         -O-R--   100   100   067    -    0
  8 Seek_Time_Performance   --S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   100   100   000    -    2715
10 Spin_Retry_Count        -O--C-   100   100   060    -    0
12 Power_Cycle_Count       -O--CK   100   100   000    -    54
22 Unknown_Attribute       PO---K   100   100   025    -    100
192 Power-Off_Retract_Count -O--CK   100   100   000    -    460
193 Load_Cycle_Count        -O--C-   100   100   000    -    460
194 Temperature_Celsius     -O----   191   191   000    -    34 (Min/Max 19/37)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   5501  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    256  Current Device Internal Status Data log
0x25       GPL     R/O    256  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2715         -
# 2  Short offline       Completed without error       00%      2702         -
# 3  Short offline       Completed without error       00%      2542         -
# 4  Extended offline    Completed without error       00%      2466         -
# 5  Short offline       Completed without error       00%      2375         -
# 6  Short offline       Completed without error       00%      2207         -
# 7  Extended offline    Completed without error       00%      2130         -
# 8  Short offline       Completed without error       00%      2039         -
# 9  Short offline       Completed without error       00%      1871         -
#10  Extended offline    Completed without error       00%      1794         -
#11  Short offline       Completed without error       00%      1703         -
#12  Short offline       Completed without error       00%      1463         -
#13  Extended offline    Completed without error       00%      1386         -
#14  Short offline       Completed without error       00%      1295         -
#15  Short offline       Completed without error       00%      1127         -
#16  Extended offline    Completed without error       00%      1050         -
#17  Short offline       Completed without error       00%       959         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   0
Device State:                        Active (0)
Current Temperature:                    34 Celsius
Power Cycle Min/Max Temperature:     34/34 Celsius
Lifetime    Min/Max Temperature:     19/37 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (27)

Index    Estimated Time   Temperature Celsius
  28    2019-03-19 14:40    34  ***************
...    ..(  7 skipped).    ..  ***************
  36    2019-03-19 14:48    34  ***************
  37    2019-03-19 14:49    35  ****************
...    ..(  5 skipped).    ..  ****************
  43    2019-03-19 14:55    35  ****************
  44    2019-03-19 14:56    34  ***************
...    ..(  5 skipped).    ..  ***************
  50    2019-03-19 15:02    34  ***************
  51    2019-03-19 15:03    35  ****************
...    ..(  7 skipped).    ..  ****************
  59    2019-03-19 15:11    35  ****************
  60    2019-03-19 15:12    34  ***************
  61    2019-03-19 15:13    34  ***************
  62    2019-03-19 15:14    34  ***************
  63    2019-03-19 15:15    33  **************
  64    2019-03-19 15:16    33  **************
  65    2019-03-19 15:17    34  ***************
...    ..( 20 skipped).    ..  ***************
  86    2019-03-19 15:38    34  ***************
  87    2019-03-19 15:39    35  ****************
...    ..(  5 skipped).    ..  ****************
  93    2019-03-19 15:45    35  ****************
  94    2019-03-19 15:46    34  ***************
...    ..(  3 skipped).    ..  ***************
  98    2019-03-19 15:50    34  ***************
  99    2019-03-19 15:51    35  ****************
...    ..(  9 skipped).    ..  ****************
109    2019-03-19 16:01    35  ****************
110    2019-03-19 16:02    34  ***************
...    ..(  3 skipped).    ..  ***************
114    2019-03-19 16:06    34  ***************
115    2019-03-19 16:07    35  ****************
116    2019-03-19 16:08    34  ***************
...    ..(  2 skipped).    ..  ***************
119    2019-03-19 16:11    34  ***************
120    2019-03-19 16:12    35  ****************
121    2019-03-19 16:13    34  ***************
...    ..( 33 skipped).    ..  ***************
  27    2019-03-19 16:47    34  ***************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) supported [please try: '-l defects']

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

Code:
root@freenas:~ # smartctl -x /dev/da1
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    XXXXXXXXX
LU WWN Device Id: 5 000cca 267c824aa
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Mar 19 16:51:49 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   93) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1137) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  --S---   130   130   054    -    108
  3 Spin_Up_Time            POS---   253   253   024    -    115 (Average 210)
  4 Start_Stop_Count        -O--C-   100   100   000    -    57
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         -O-R--   100   100   067    -    0
  8 Seek_Time_Performance   --S---   128   128   020    -    18
  9 Power_On_Hours          -O--C-   100   100   000    -    2700
10 Spin_Retry_Count        -O--C-   100   100   060    -    0
12 Power_Cycle_Count       -O--CK   100   100   000    -    57
22 Unknown_Attribute       PO---K   100   100   025    -    100
192 Power-Off_Retract_Count -O--CK   100   100   000    -    573
193 Load_Cycle_Count        -O--C-   100   100   000    -    573
194 Temperature_Celsius     -O----   191   191   000    -    34 (Min/Max 19/39)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   5501  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    256  Current Device Internal Status Data log
0x25       GPL     R/O    256  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2687         -
# 2  Short offline       Completed without error       00%      2527         -
# 3  Extended offline    Completed without error       00%      2451         -
# 4  Short offline       Completed without error       00%      2360         -
# 5  Short offline       Completed without error       00%      2192         -
# 6  Extended offline    Completed without error       00%      2115         -
# 7  Short offline       Completed without error       00%      2024         -
# 8  Short offline       Completed without error       00%      1856         -
# 9  Extended offline    Completed without error       00%      1780         -
#10  Short offline       Completed without error       00%      1688         -
#11  Short offline       Completed without error       00%      1448         -
#12  Extended offline    Completed without error       00%      1372         -
#13  Short offline       Completed without error       00%      1280         -
#14  Short offline       Completed without error       00%      1112         -
#15  Extended offline    Completed without error       00%      1035         -
#16  Short offline       Completed without error       00%       944         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   0
Device State:                        Active (0)
Current Temperature:                    34 Celsius
Power Cycle Min/Max Temperature:     34/34 Celsius
Lifetime    Min/Max Temperature:     19/39 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -40/70 Celsius
Temperature History Size (Index):    128 (100)

Index    Estimated Time   Temperature Celsius
101    2019-03-19 14:44    35  ****************
102    2019-03-19 14:45    34  ***************
103    2019-03-19 14:46    34  ***************
104    2019-03-19 14:47    34  ***************
105    2019-03-19 14:48    35  ****************
...    ..( 12 skipped).    ..  ****************
118    2019-03-19 15:01    35  ****************
119    2019-03-19 15:02    34  ***************
120    2019-03-19 15:03    35  ****************
...    ..(  5 skipped).    ..  ****************
126    2019-03-19 15:09    35  ****************
127    2019-03-19 15:10    34  ***************
...    ..( 13 skipped).    ..  ***************
  13    2019-03-19 15:24    34  ***************
  14    2019-03-19 15:25    35  ****************
...    ..(  2 skipped).    ..  ****************
  17    2019-03-19 15:28    35  ****************
  18    2019-03-19 15:29    34  ***************
...    ..(  3 skipped).    ..  ***************
  22    2019-03-19 15:33    34  ***************
  23    2019-03-19 15:34    35  ****************
...    ..( 11 skipped).    ..  ****************
  35    2019-03-19 15:46    35  ****************
  36    2019-03-19 15:47    34  ***************
  37    2019-03-19 15:48    35  ****************
...    ..( 13 skipped).    ..  ****************
  51    2019-03-19 16:02    35  ****************
  52    2019-03-19 16:03    34  ***************
  53    2019-03-19 16:04    34  ***************
  54    2019-03-19 16:05    35  ****************
...    ..(  3 skipped).    ..  ****************
  58    2019-03-19 16:09    35  ****************
  59    2019-03-19 16:10    34  ***************
...    ..( 39 skipped).    ..  ***************
  99    2019-03-19 16:50    34  ***************
100    2019-03-19 16:51    35  ****************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) supported [please try: '-l defects']

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The SAS controller is a high airflow type. Having the case open is bad for airflow in most cases. I have seen them fail permanently due to overheating.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Some forum members have needed to add fans to keep their SAS chips cool enough:

1553090785144.png

Also: https://www.reddit.com/r/DataHoarde...igned_and_3d_printed_a_fan_mount_for_my_dell/
 

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
Some forum members have needed to add fans to keep their SAS chips cool enough:

View attachment 29413

Also: https://www.reddit.com/r/DataHoarde...igned_and_3d_printed_a_fan_mount_for_my_dell/
Wow, cool. Might have to do something like that because heat may be the issue after all... All eight of the drives passed the long SMART tests and the server has been running healthy for 24 straight hours. It's not like anything has changed regarding the server's location though and it's been good for many years now, including the past few winter (heating) months. During the spring/summer I've gotten the rare HDD temp warning (40 degrees), but from what I've seen recently they've been running consistently at 30 degrees .

I'm going to try another scrub now.
 
Last edited:

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
It successfully completed a scrub. But what about these cksum errors?
Code:
 pool: volume1
state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 560K in 0 days 05:45:47 with 0 errors on Wed Mar 20 22:38:48 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume1                                         DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/676cd1f2-f4e6-11e8-8bf8-002590d65107  ONLINE       0     0     0
            gptid/13ed5036-f35c-11e8-b6dc-002590d65107  ONLINE       0     0     0
            gptid/d6006dcf-f1b2-11e8-9ccb-002590d65107  DEGRADED     0     0    73  too many errors
            gptid/818a27df-f135-11e8-ab2a-002590d65107  DEGRADED     0     0    75  too many errors
            gptid/0d8dfe0e-f290-11e8-a1a0-002590d65107  ONLINE       0     0     0
            gptid/55d6e72d-f3d3-11e8-83fe-002590d65107  ONLINE       0     0     0
            gptid/03a0acac-f428-11e8-afa9-002590d65107  ONLINE       0     0     1
            gptid/3e598cb8-f499-11e8-8050-002590d65107  ONLINE       0     0     0

errors: No known data errors

see: http://illumos.org/msg/ZFS-8000-9P says:
Find the device with a non-zero error count for READ, WRITE, or CKSUM. This indicates that the device has experienced a read I/O error, write I/O error, or checksum validation error. Because the device is part of a mirror or RAID-Z device, ZFS was able to recover from the error and subsequently repair the damaged data.
That seems to say that ZFS should resilver/repair them, but I don't like that "too many errors". Am I ok to zpool clear? (I am only in Raidz2 btw, I'm backing up my important data)

edit: here are the smart test results again:
Code:
# smartctl -a /dev/da1
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    XXXXXXXX
LU WWN Device Id: 5 000cca 267c824aa
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 21 00:30:30 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   93) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1137) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   130   130   054    Old_age   Offline      -       108
  3 Spin_Up_Time            0x0007   253   253   024    Pre-fail  Always       -       115 (Average 210)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       57
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       2732
10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       57
22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       574
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       574
194 Temperature_Celsius     0x0002   196   196   000    Old_age   Always       -       33 (Min/Max 19/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2720         -
# 2  Short offline       Completed without error       00%      2701         -
# 3  Short offline       Completed without error       00%      2687         -
# 4  Short offline       Completed without error       00%      2527         -
# 5  Extended offline    Completed without error       00%      2451         -
# 6  Short offline       Completed without error       00%      2360         -
# 7  Short offline       Completed without error       00%      2192         -
# 8  Extended offline    Completed without error       00%      2115         -
# 9  Short offline       Completed without error       00%      2024         -
#10  Short offline       Completed without error       00%      1856         -
#11  Extended offline    Completed without error       00%      1780         -
#12  Short offline       Completed without error       00%      1688         -
#13  Short offline       Completed without error       00%      1448         -
#14  Extended offline    Completed without error       00%      1372         -
#15  Short offline       Completed without error       00%      1280         -
#16  Short offline       Completed without error       00%      1112         -
#17  Extended offline    Completed without error       00%      1035         -
#18  Short offline       Completed without error       00%       944         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Code:
smartctl -a /dev/da0
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    XXXXXXXX
LU WWN Device Id: 5 000cca 273e05f2b
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 21 00:32:50 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   93) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1164) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   129   129   054    Old_age   Offline      -       112
  3 Spin_Up_Time            0x0007   253   253   024    Pre-fail  Always       -       215 (Average 130)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       54
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       2747
10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       54
22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       461
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       461
194 Temperature_Celsius     0x0002   196   196   000    Old_age   Always       -       33 (Min/Max 19/37)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2734         -
# 2  Short offline       Completed without error       00%      2715         -
# 3  Short offline       Completed without error       00%      2702         -
# 4  Short offline       Completed without error       00%      2542         -
# 5  Extended offline    Completed without error       00%      2466         -
# 6  Short offline       Completed without error       00%      2375         -
# 7  Short offline       Completed without error       00%      2207         -
# 8  Extended offline    Completed without error       00%      2130         -
# 9  Short offline       Completed without error       00%      2039         -
#10  Short offline       Completed without error       00%      1871         -
#11  Extended offline    Completed without error       00%      1794         -
#12  Short offline       Completed without error       00%      1703         -
#13  Short offline       Completed without error       00%      1463         -
#14  Extended offline    Completed without error       00%      1386         -
#15  Short offline       Completed without error       00%      1295         -
#16  Short offline       Completed without error       00%      1127         -
#17  Extended offline    Completed without error       00%      1050         -
#18  Short offline       Completed without error       00%       959         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Code:
# smartctl -a /dev/da5
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    XXXXXXXXX
LU WWN Device Id: 5 000cca 273da3d00
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 21 00:34:27 2019 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   93) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1173) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   129   129   054    Old_age   Offline      -       112
  3 Spin_Up_Time            0x0007   176   176   024    Pre-fail  Always       -       386 (Average 360)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       11
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       2560
10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       11
22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       114
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       114
194 Temperature_Celsius     0x0002   203   203   000    Old_age   Always       -       32 (Min/Max 21/40)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2548         -
# 2  Short offline       Completed without error       00%      2530         -
# 3  Extended offline    Completed without error       00%      2380         -
# 4  Short offline       Completed without error       00%      2289         -
# 5  Short offline       Completed without error       00%      2121         -
# 6  Extended offline    Completed without error       00%      2045         -
# 7  Short offline       Completed without error       00%      1953         -
# 8  Short offline       Completed without error       00%      1785         -
# 9  Extended offline    Completed without error       00%      1709         -
#10  Short offline       Completed without error       00%      1617         -
#11  Short offline       Completed without error       00%      1377         -
#12  Extended offline    Completed without error       00%      1301         -
#13  Short offline       Completed without error       00%      1209         -
#14  Short offline       Completed without error       00%      1041         -
#15  Extended offline    Completed without error       00%       965         -
#16  Short offline       Completed without error       00%       873         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

edit2: I should note that I noticed my PSU fan starts and stops repeatedly while the system is on. I wonder if something is failing there...
 
Last edited:

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
After backing up my most important data I've been procrastinating on this a little bit. I know I'm playing with fire with 2 degraded drives, but if I lose my movie collection I'll survive.

Is it better that I don't perform another scrub until I figure out what's causing these chksum errors? Idk if it makes sense to replace the LSI card and cables out of an abundance of caution. I saw in another thread @Chris Moore that you recommended this PCI-E 3.0 card which runs cooler: https://www.ebay.com/itm/HP-H220-6G...0-IT-Mode-for-ZFS-FreeNAS-unRAID/162862201664

I want to open up the box and see how dusty it is tomorrow...

Thank you for all the help

Edit: also don't know if I should 1. replace the drive I RMAed already (the one with 1 chksum error), 2. plead my case to WD that there was a mistake and my original drive was ok and just see if they'll take back the drive they're sending, or 3. keep the drive as a spare for the $137 they're going to charge my card. #3 might make them less trusting if I try to RMA the other two drives though.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
3. keep the drive as a spare for the $137 they're going to charge my card.
That would be my vote, but you have to keep in mind that it probably voids the warranty on the drive that it was ordered against.
I want to open up the box and see how dusty it is tomorrow...
How did that go? Dust is a constant battle around here.
 

EvanVanVan

Patron
Joined
Feb 1, 2014
Messages
211
How did that go? Dust is a constant battle around here.

Dust wasn't too bad, but I blew everything out anyway and re-seat all the cables and the LSI card. I also noticed I had for some unknown reason plugged the LSI card into the "PCI-E 2.0 4X on 8X" spot rather than an open 8X slot when I built this thing 5 years ago. Anyway, I put it in a proper 8X slot.

I ran another scrub and everything came out clean, no checksum errors on any of the drives. I increased my automatic SMART and scrub schedules a little bit now too.

Thank you again for all the help.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I have seen re-seating connections fix problems before, so I hope this makes it all better. Thanks for posting back the results.
 
Top