Smartctl open device: /dev/da4 failed: INQUIRY failed

MorkaiTheWolf

Dabbler
Joined
Aug 8, 2018
Messages
32
This evening I received an alert on my system stating that my main zpool (labeled Tank) had become degraded.
The specific messages:
Code:
CRITICAL: Dec. 23, 2018, 6:22 p.m. - Device: /dev/da4 [SAT], not capable of SMART self-check
CRITICAL: Dec. 23, 2018, 6:22 p.m. - Device: /dev/da4 [SAT], failed to read SMART Attribute Data
CRITICAL: Dec. 23, 2018, 6:22 p.m. - Device: /dev/da4 [SAT], Read SMART Self-Test Log Failed
CRITICAL: Dec. 23, 2018, 6:22 p.m. - Device: /dev/da4 [SAT], Read SMART Error Log Failed


I am unsure if I should try to reboot the system but I fear this is a sign that one of my disks is failing. To provide some further background, here is some system notes:
Version: FreeNAS-11.1-U6 (caffd76fa)
Motherboard: ASRock Motherboard ATX DDR3 1066 Intel LGA 2011 EP2C602-4L/D16
CPU: 2x Xeon E5-2680 v2 @ 2.80GHz
CPU Cooler: 2x Noctua i4
RAM: 56 GB (two 4GB sticks went bad that I still need to replace)
PSU: EVGA SuperNOVA 850 T2
Case: Phanteks Enthoo Pro
HBA: LSI 9210-8i
Storage:
6x WD Red HE 10TB
2x Samsung 850 EVO 250GB

Zpool status
Code:
  pool: Jails
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:04:14 with 0 errors on Sat Dec  1 05:04:14 2018
config:

        NAME                                            STATE     READ WRITE CKSUM
        Jails                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/d7e792f0-aff6-11e8-a085-d05099c3f976  ONLINE       0     0     0
            gptid/d86c6f5e-aff6-11e8-a085-d05099c3f976  ONLINE       0     0     0

errors: No known data errors

  pool: Tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0 in 0 days 13:44:27 with 0 errors on Sun Dec  2 13:44:28 2018
config:

        NAME                                            STATE     READ WRITE CKSUM
        Tank                                            DEGRADED     0     0     0
          raidz3-0                                      DEGRADED     0     0     0
            gptid/1f66829a-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0
            gptid/20911f23-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0
            gptid/21b89c69-d719-11e8-a0f2-d05099c3f976  FAULTED      1     1     0  too many errors
            gptid/22d8de81-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0
            gptid/23f9ec88-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0
            gptid/251ff784-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:01:54 with 0 errors on Sun Dec 23 03:46:54 2018
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da8p2     ONLINE       0     0     0


glabel status:
Code:
                                      Name  Status  Components
gptid/d7e792f0-aff6-11e8-a085-d05099c3f976     N/A  da0p2
gptid/d86c6f5e-aff6-11e8-a085-d05099c3f976     N/A  da1p2
gptid/1f66829a-d719-11e8-a0f2-d05099c3f976     N/A  da2p2
gptid/20911f23-d719-11e8-a0f2-d05099c3f976     N/A  da3p2
gptid/21b89c69-d719-11e8-a0f2-d05099c3f976     N/A  da4p2
gptid/22d8de81-d719-11e8-a0f2-d05099c3f976     N/A  da5p2
gptid/23f9ec88-d719-11e8-a0f2-d05099c3f976     N/A  da6p2
gptid/251ff784-d719-11e8-a0f2-d05099c3f976     N/A  da7p2
gptid/27bbfa76-afe1-11e8-8edb-d05099c3f976     N/A  da8p1


With this in mind, I figured I would try to see what might be in the SMART report and that is where the weirdness begins, this is all I receive for an output:
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/da4 failed: INQUIRY failed


This makes me think the drive might be dead or dying since I can't even read it. Now, I do have a SMART status report running that should send me updates periodically and here are the results from 7am this morning:
Code:
########## SMART status report for da4 drive (: 7JGGTLZC) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   130   130   054    Old_age   Offline      -       108
  3 Spin_Up_Time            0x0007   149   149   024    Pre-fail  Always       -       439 (Average 442)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1900
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   098   098   000    Old_age   Always       -       2422
193 Load_Cycle_Count        0x0012   098   098   000    Old_age   Always       -       2422
194 Temperature_Celsius     0x0002   250   250   000    Old_age   Always       -       26 (Min/Max 20/36)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline    Completed without error       00%       246         -


I do feel like I am missing something though in what could be the cause. Would it be worth it to reboot the system to see if I can run a smartctl -a against the drive again or would that cause more issues?

Luckily, I've documented where the drive is in the case so a swap shouldn't be terribly difficult but it will take some time to find a 10TB replacement for a decent price.
 
Joined
May 10, 2017
Messages
838
Looks like the drive dropped offline, reboot or power cycle the server to see if it comes back online and if yes grab a SMART report
 

MorkaiTheWolf

Dabbler
Joined
Aug 8, 2018
Messages
32
After rebooting I'm able to run a smart report on that drive. Here are the results:
Code:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EZAZ-11TDBA0
Serial Number:    7JK99MKC
LU WWN Device Id: 5 000cca 266ee8201
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Dec 24 02:58:30 2018 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   93) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1167) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   129   129   054    Old_age   Offline      -       112
  3 Spin_Up_Time            0x0007   149   149   024    Pre-fail  Always       -       440 (Average 440)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1920
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   098   098   000    Old_age   Always       -       2426
193 Load_Cycle_Count        0x0012   098   098   000    Old_age   Always       -       2426
194 Temperature_Celsius     0x0002   004   004   000    Old_age   Always       -       25 (Min/Max 20/35)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       249         -
# 2  Short offline       Completed without error       00%       118         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Unfortunately, it appears the drive still shows as offline on my zpool
Code:
root@freenas:~ # zpool status Tank
  pool: Tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0 days 13:44:27 with 0 errors on Sun Dec  2 13:44:28 2018
config:

        NAME                                            STATE     READ WRITE CKSUM
        Tank                                            DEGRADED     0     0     0
          raidz3-0                                      DEGRADED     0     0     0
            gptid/1f66829a-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0
            gptid/20911f23-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0
            2652733316161051457                         UNAVAIL      0     0     0  was /dev/gptid/21b89c69-d719-11e8-a0f2-d05099c3f976
            gptid/22d8de81-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0
            gptid/23f9ec88-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0
            gptid/251ff784-d719-11e8-a0f2-d05099c3f976  ONLINE       0     0     0
 
Joined
May 10, 2017
Messages
838
SMART looks fine, I would swap/replace cables/backplane slot to rule them out and online the disk.
 

MorkaiTheWolf

Dabbler
Joined
Aug 8, 2018
Messages
32
Upon further review, I don't think that /dev/da4 is the same drive.

In the email, it says it was 7JGGTLZC for the drive and looking in the gui, that drive is no longer available. In fact, I can't seem to find this drive anywhere on my system (in terms of commands anyways).

All 6 of these drives are all attached via the same HBA cable.
 

MorkaiTheWolf

Dabbler
Joined
Aug 8, 2018
Messages
32
Ran the following to see if the system detected the drive:
Code:
camcontrol devlist
<ATA Samsung SSD 850 2B6Q>         at scbus0 target 2 lun 0 (pass0,da0)
<ATA Samsung SSD 850 2B6Q>         at scbus0 target 3 lun 0 (pass1,da1)
<Marvell Console 1.01>             at scbus8 target 0 lun 0 (pass2)
<ATA WDC WD100EZAZ-11 0A83>        at scbus15 target 6 lun 0 (pass3,da2)
<ATA WDC WD100EZAZ-11 0A83>        at scbus15 target 7 lun 0 (pass4,da3)
<ATA WDC WD100EZAZ-11 0A83>        at scbus15 target 9 lun 0 (pass5,da4)
<ATA WDC WD100EZAZ-11 0A83>        at scbus15 target 10 lun 0 (pass6,da5)
<ATA WDC WD100EZAZ-11 0A83>        at scbus15 target 11 lun 0 (pass7,da6)
<SanDisk Cruzer Fit 1.00>          at scbus17 target 0 lun 0 (pass8,da7)


I believe the affected drive in scbus15 target 8.

Is there anyway to try to make the drive come online again? Merely cause I'm curious if that's the bad drive.
 
Joined
May 10, 2017
Messages
838
Powerdown, replace/swap cables/backplane with another drive, if it still doesn't come online it's likely dead.
 

MorkaiTheWolf

Dabbler
Joined
Aug 8, 2018
Messages
32
Wanted to circle back on this as I've made some headway.

So, it appears all of my 10TB drives were shucked from external hard drive enclosures. Found this out when I was trying to follow up on the warranty status as they all come back as WD EasyStore drives rather WD Reds.

I believe I've stumbled on to what could be causing this. Look into a few other things about them and it appears that newer EasyStore drives try to pull 3.3v on the third pin (See more details here: https://youtu.be/9W3-uOl4ruc). This means the power disable feature is kicking on and making the drive no longer viewable to the system. Typically, I would expect this would have happened when I first installed the drives but all 6 worked flawlessly. I'm going to be opening the system in the next few days when I'm in town to see if this is the case or if the drive is in fact dead. If dead, RMA won't work as I was not provided the original case and WD will not honor the warranty without it.

I should say that if is indeed the case, I'll be getting the appropriate adapters to get all 6 drives wired with Molex to SATA vs straight SATA power cable from the PSU. Unfortunately, not in the budget to replace all 6 with WD Reds directly from WD at this time. :/

Either way, I'll post another update once I get some more information. This might be entirely user error and me not realizing it. Derp.
 
Top