suddenly failed tank on production

blckhm

Dabbler
Joined
Sep 24, 2018
Messages
42
Hi,

I could not figure out whats happened on 01 May on our production server.

We've 2 pool on storage. One of is giant 12x4TB array (tank1) and the other one is single SSD drive (480GB) pool which name is "tank4".

1620228744392.png

Suddenly, tank4 became unavailable at 11:00 am. Actually it won't be a problem because all kinds of stuff stored in tank1 but all entire NFS service crashed due to tank4 failure. Each application who try to connect NFS on our application site throwed an error until we reset to server from power button.

We could not stop/start NFS service over GUI.
We could not restart NFS service over CLI.

When I checked 'ps aux | grep nfs', process 1886 stucked as running. When I tried to kill 1886, nothing changed. It nailed there!

Everytime I get an error like that;
Code:
[2021/05/01 15:36:30] (WARNING) middlewared.plugins.service_.services.base_freebsd.freebsd_service():134 - lockd forcestop failed with code 1: 'lockd not running?\n'
[2021/05/01 15:36:31] (WARNING) middlewared.plugins.service_.services.base_freebsd.freebsd_service():134 - statd forcestop failed with code 1: 'statd not running?\n'


When I tried to reboot system over CLI, it still tried to stop that process :)

Code:
[2021/05/01 15:36:30] (WARNING) middlewared.plugins.service_.services.base_freebsd.freebsd_service():134 - lockd forcestop failed with code 1: 'lockd not running?\n'
[2021/05/01 15:36:31] (WARNING) middlewared.plugins.service_.services.base_freebsd.freebsd_service():134 - statd forcestop failed with code 1: 'statd not running?\n'
root@freenas2[/var/log]# /usr/sbin/service nfsd onestart
nfsd already running?  (pid=1886).
root@freenas2[/var/log]# /usr/sbin/service nfsd restart
Stopping nfsd.
Waiting for PIDS: 1886


So I have to reset over power button.

Now, I collect all logs during failure so maybe some of our guru's could explain.


Code:
[2021/05/01 11:18:28] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 11:53:30] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 12:18:31] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 12:53:32] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 13:18:33] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 13:53:35] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 14:18:36] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 14:53:37] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 15:43:22] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 15:43:23] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 15:43:23] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 15:43:35] (ERROR) libzfs.<listcomp>():430 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended
[2021/05/01 15:43:37] (ERROR) libzfs.query():494 - Failed to retrieve dataset handle for tank4: pool I/O is currently suspended


Code:
May  1 10:58:40 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 1793 loginfo 31170000
May  1 10:58:40 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 1121 loginfo 31170000
May  1 10:58:40 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 1150 loginfo 31170000
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1a 01 ff 30 00 00 68 00
May  1 10:58:40 freenas2 mpr0: (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:40 freenas2 Controller reported scsi ioc terminated tgt 20 SMID 831 loginfo 31170000
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): Retrying command, 3 more tries remain
May  1 10:58:40 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 785 loginfo 31170000
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1e 02 3f 28 00 00 28 00
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): Retrying command, 3 more tries remain
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): READ(10). CDB: 28 00 1d 77 bf 00 00 00 10 00
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): Retrying command, 3 more tries remain
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1a 01 fe e8 00 00 20 00
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): Retrying command, 3 more tries remain
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): READ(10). CDB: 28 00 37 61 20 78 00 00 08 00
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:40 freenas2 (da12:mpr0:0:20:0): Retrying command, 3 more tries remain
May  1 10:58:41 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 458 loginfo 31170000
May  1 10:58:41 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 1645 loginfo 31170000
May  1 10:58:41 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 2152 loginfo 31170000
May  1 10:58:41 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 1973 loginfo 31170000
May  1 10:58:41 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 485 loginfo 31170000
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): READ(10). CDB: 28 00 37 61 20 78 00 00 08 00
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): Retrying command, 2 more tries remain
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): READ(10). CDB: 28 00 1d 77 bf 00 00 00 10 00
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): Retrying command, 2 more tries remain
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1e 02 3f 28 00 00 28 00
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): Retrying command, 2 more tries remain
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1a 01 fe e8 00 00 20 00
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): Retrying command, 2 more tries remain
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1a 01 ff 30 00 00 68 00
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:41 freenas2 (da12:mpr0:0:20:0): Retrying command, 2 more tries remain
May  1 10:58:46 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 2060 loginfo 31110e05
May  1 10:58:46 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 949 loginfo 31110e05
May  1 10:58:46 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 617 loginfo 31110e05
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1a 01 ff 30 00 00 68 00
May  1 10:58:46 freenas2 mpr0: (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): Retrying command, 1 more tries remain
May  1 10:58:46 freenas2 Controller reported scsi ioc terminated tgt 20 SMID 1865 loginfo 31110e05
May  1 10:58:46 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 481 loginfo 31110e05
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1a 01 fe e8 00 00 20 00
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): Retrying command, 1 more tries remain
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1e 02 3f 28 00 00 28 00
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): Retrying command, 1 more tries remain
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): READ(10). CDB: 28 00 1d 77 bf 00 00 00 10 00
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): Retrying command, 1 more tries remain
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): READ(10). CDB: 28 00 37 61 20 78 00 00 08 00
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:46 freenas2 (da12:mpr0:0:20:0): Retrying command, 1 more tries remain
May  1 10:58:47 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 953 loginfo 31170000
May  1 10:58:47 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 1470 loginfo 31170000
May  1 10:58:47 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 632 loginfo 31170000
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): READ(10). CDB: 28 00 37 61 20 78 00 00 08 00
May  1 10:58:47 freenas2 mpr0: (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): Retrying command, 0 more tries remain
May  1 10:58:47 freenas2 Controller reported scsi ioc terminated tgt 20 SMID 273 loginfo 31170000
May  1 10:58:47 freenas2 mpr0: Controller reported scsi ioc terminated tgt 20 SMID 170 loginfo 31170000
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): READ(10). CDB: 28 00 1d 77 bf 00 00 00 10 00
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): Retrying command, 0 more tries remain
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1e 02 3f 28 00 00 28 00
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): Retrying command, 0 more tries remain
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1a 01 fe e8 00 00 20 00
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): Retrying command, 0 more tries remain
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): WRITE(10). CDB: 2a 00 1a 01 ff 30 00 00 68 00
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): CAM status: CCB request completed with an error
May  1 10:58:47 freenas2 (da12:mpr0:0:20:0): Retrying command, 0 more tries remain
May  1 10:58:54 freenas2 (da12:mpr0:0:20:0): READ(10). CDB: 28 00 37 61 20 78 00 00 08 00
May  1 10:58:54 freenas2 (da12:mpr0:0:20:0): CAM status: SCSI Status Error
May  1 10:58:54 freenas2 (da12:mpr0:0:20:0): SCSI status: Check Condition
May  1 10:58:54 freenas2 (da12:mpr0:0:20:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
May  1 10:58:54 freenas2 (da12:mpr0:0:20:0): Error 6, Retries exhausted
May  1 10:58:54 freenas2 (da12:mpr0:0:20:0): Invalidating pack
May  1 10:58:54 freenas2 Solaris: WARNING: Pool 'tank4' has encountered an uncorrectable I/O failure and has been suspended.
May  1 11:00:04 freenas2 1 2021-05-01T11:00:04.177442+03:00 freenas2.istanbul mountd 1879 - - can't open /etc/zfs/exports
May  1 15:43:38 freenas2 GEOM_ELI: Device mirror/swap4.eli destroyed.
May  1 15:43:39 freenas2 GEOM_MIRROR: Device swap4: provider destroyed.
May  1 15:43:39 freenas2 GEOM_MIRROR: Device swap4 destroyed.
May  1 15:43:39 freenas2 GEOM_ELI: Device mirror/swap3.eli destroyed.
May  1 15:43:40 freenas2 GEOM_MIRROR: Device swap3: provider destroyed.
May  1 15:43:40 freenas2 GEOM_MIRROR: Device swap3 destroyed.
May  1 15:43:40 freenas2 GEOM_ELI: Device mirror/swap2.eli destroyed.
May  1 15:43:40 freenas2 GEOM_MIRROR: Device swap2: provider destroyed.
May  1 15:43:40 freenas2 GEOM_MIRROR: Device swap2 destroyed.
May  1 15:43:41 freenas2 GEOM_ELI: Device mirror/swap1.eli destroyed.
May  1 15:43:41 freenas2 GEOM_MIRROR: Device swap1: provider destroyed.
May  1 15:43:41 freenas2 GEOM_MIRROR: Device swap1 destroyed.
May  1 15:43:41 freenas2 GEOM_ELI: Device mirror/swap0.eli destroyed.
May  1 15:43:42 freenas2 GEOM_MIRROR: Device swap0: provider destroyed.
May  1 15:43:42 freenas2 GEOM_MIRROR: Device swap0 destroyed.
May  1 15:47:42 freenas2 1 2021-05-01T15:47:42.675726+03:00 freenas2.istanbul collectd 2056 - - Traceback (most recent call last):
  File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 62, in read
    with Client() as c:
  File "/usr/local/lib/python3.8/site-packages/middlewared/client/client.py", line 281, in __init__
    self._ws.connect()
  File "/usr/local/lib/python3.8/site-packages/middlewared/client/client.py", line 124, in connect
    rv = super(WSClient, self).connect()
  File "/usr/local/lib/python3.8/site-packages/ws4py/client/__init__.py", line 223, in connect
    bytes = self.sock.recv(128)
socket.timeout: timed out
May  1 15:48:58 freenas2 1 2021-05-01T15:48:58.033641+03:00 freenas2.istanbul reboot 5869 - - rebooted by root
May  1 15:50:30 freenas2 1 2021-05-01T15:50:30.154692+03:00 freenas2.istanbul shutdown 6227 - - reboot by root:
May  1 15:50:53 freenas2 proftpd[1962]: 127.0.0.1 - ProFTPD killed (signal 15)
May  1 15:50:53 freenas2 proftpd[1962]: 127.0.0.1 - ProFTPD 1.3.6b standalone mode SHUTDOWN
May  1 15:50:53 freenas2 1 2021-05-01T15:50:53.988686+03:00 freenas2.istanbul ntpd 1937 - - ntpd exiting on signal 15 (Terminated)



Code:
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Micron 5100 Pro / 5200 SSDs
Device Model:     MTFDDAK480TDN
Serial Number:    200426191BAF
LU WWN Device Id: 5 00a075 126191baf
Add. Product Id:  DELL(tm)
Firmware Version: D1DF003
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu May  6 09:46:32 2021 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         ( 1474) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  26) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002e   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       9403
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       188
 13 Read_Soft_Error_Rate    0x0032   100   100   000    Old_age   Always       -       0
173 Avg_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       5
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       33
179 Used_Rsvd_Blk_Cnt_Tot   0x0033   100   100   000    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0032   000   000   000    Old_age   Always       -       9898
181 Program_Fail_Cnt_Total  0x0032   100   100   001    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   001    Old_age   Always       -       0
183 SATA_Int_Downshift_Ct   0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       42
193 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   059   037   000    Old_age   Always       -       41 (Min/Max 16/63)
195 Hardware_ECC_Recovered  0x002e   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
201 Unknown_SSD_Attribute   0x0033   100   100   001    Pre-fail  Always       -       0
202 Percent_Lifetime_Remain 0x0033   100   100   005    Pre-fail  Always       -       9659
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 RAIN_Success_Recovered  0x0032   100   100   000    Old_age   Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       9307331734
245 Unknown_Attribute       0x0030   100   100   001    Old_age   Offline      -       100
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       290855349
248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       14373281

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9394         -
# 2  Short offline       Completed without error       00%      9370         -
# 3  Short offline       Completed without error       00%      9346         -
# 4  Short offline       Completed without error       00%      9322         -
# 5  Short offline       Completed without error       00%      9298         -
# 6  Offline             Completed without error       00%      9290         -
# 7  Extended offline    Completed without error       00%      9290         -
# 8  Short offline       Completed without error       00%      9290         -
# 9  Short offline       Completed without error       00%      9290         -
#10  Short offline       Completed without error       00%      9274         -
#11  Short offline       Completed without error       00%      9250         -
#12  Short offline       Completed without error       00%      9226         -
#13  Short offline       Completed without error       00%      9202         -
#14  Short offline       Completed without error       00%      9178         -
#15  Short offline       Completed without error       00%      9154         -
#16  Short offline       Completed without error       00%      9130         -
#17  Short offline       Completed without error       00%      9106         -
#18  Short offline       Completed without error       00%      9082         -
#19  Short offline       Completed without error       00%      9058         -
#20  Short offline       Completed without error       00%      9034         -
#21  Short offline       Completed without error       00%      9010         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
You may have had your system dataset on that pool.

You will need to replace that disk (da12) and restore any contents from backup.

The incomplete SMART results you posted aren't much help to see if there are any issues reported by SMART, we would need to see the rest of it if you want help confirming what the logs are already saying... that da12 is dying or already dead.
 

blckhm

Dabbler
Joined
Sep 24, 2018
Messages
42
You may have had your system dataset on that pool.

You will need to replace that disk (da12) and restore any contents from backup.

The incomplete SMART results you posted aren't much help to see if there are any issues reported by SMART, we would need to see the rest of it if you want help confirming what the logs are already saying... that da12 is dying or already dead.

No way, because I added that SSD after initial setup with 12 drives. That SSD will be some kind of experiment.

System worked without any problem for 6 months until I try to use under load that SSD drive for 2 weeks.


Btw, sorry for trimmed smart report. I edited the first message and also here it is;

Code:
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Micron 5100 Pro / 5200 SSDs
Device Model:     MTFDDAK480TDN
Serial Number:    200426191BAF
LU WWN Device Id: 5 00a075 126191baf
Add. Product Id:  DELL(tm)
Firmware Version: D1DF003
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu May  6 09:46:32 2021 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         ( 1474) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  26) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002e   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       9403
12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       188
13 Read_Soft_Error_Rate    0x0032   100   100   000    Old_age   Always       -       0
173 Avg_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       5
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       33
179 Used_Rsvd_Blk_Cnt_Tot   0x0033   100   100   000    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0032   000   000   000    Old_age   Always       -       9898
181 Program_Fail_Cnt_Total  0x0032   100   100   001    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   001    Old_age   Always       -       0
183 SATA_Int_Downshift_Ct   0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       42
193 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   059   037   000    Old_age   Always       -       41 (Min/Max 16/63)
195 Hardware_ECC_Recovered  0x002e   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
201 Unknown_SSD_Attribute   0x0033   100   100   001    Pre-fail  Always       -       0
202 Percent_Lifetime_Remain 0x0033   100   100   005    Pre-fail  Always       -       9659
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 RAIN_Success_Recovered  0x0032   100   100   000    Old_age   Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       9307331734
245 Unknown_Attribute       0x0030   100   100   001    Old_age   Offline      -       100
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       290855349
248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       14373281

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9394         -
# 2  Short offline       Completed without error       00%      9370         -
# 3  Short offline       Completed without error       00%      9346         -
# 4  Short offline       Completed without error       00%      9322         -
# 5  Short offline       Completed without error       00%      9298         -
# 6  Offline             Completed without error       00%      9290         -
# 7  Extended offline    Completed without error       00%      9290         -
# 8  Short offline       Completed without error       00%      9290         -
# 9  Short offline       Completed without error       00%      9290         -
#10  Short offline       Completed without error       00%      9274         -
#11  Short offline       Completed without error       00%      9250         -
#12  Short offline       Completed without error       00%      9226         -
#13  Short offline       Completed without error       00%      9202         -
#14  Short offline       Completed without error       00%      9178         -
#15  Short offline       Completed without error       00%      9154         -
#16  Short offline       Completed without error       00%      9130         -
#17  Short offline       Completed without error       00%      9106         -
#18  Short offline       Completed without error       00%      9082         -
#19  Short offline       Completed without error       00%      9058         -
#20  Short offline       Completed without error       00%      9034         -
#21  Short offline       Completed without error       00%      9010         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 42
Although I also see that you have another odd attribute possibly indicating that this disk isn't correctly matched in the database (see below), this Command Timeout value would perhaps match with what you were seeing with CAM in the logs.

If that number is increasing and you continue to see CAM timeouts in dmesg, the disk is bad and needs replacement.

202 Percent_Lifetime_Remain 0x0033 100 100 005 Pre-fail Always - 9659
Not sure how a "percent" can have this value... as mentioned above, maybe we're not looking at the right mapping of the values.
 

blckhm

Dabbler
Joined
Sep 24, 2018
Messages
42
Although I also see that you have another odd attribute possibly indicating that this disk isn't correctly matched in the database (see below), this Command Timeout value would perhaps match with what you were seeing with CAM in the logs.

If that number is increasing and you continue to see CAM timeouts in dmesg, the disk is bad and needs replacement.


Not sure how a "percent" can have this value... as mentioned above, maybe we're not looking at the right mapping of the values.

So, is there any way to match with right values ? After the incident, I stopped to use that drive immediately so I could not track that command timeouts.

Actually, I need to understand why NFS service broken in a single pool failure ? Because some of my storages have multiple pools with different layouts and I have to worry about these servers. If a single pool failure cause to be unsteady NFS service, I have to change my network sharing protocol from NFS to iscsi etc.
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
Not sure how a "percent" can have this value... as mentioned above, maybe we're not looking at the right mapping of the values.

Code:
202 Percent_Lifetime_Remain 0x0033 100 100 005 Pre-fail Always - 9659 

I think percent is 100,
0x33 is flags
first time 100 is normalized value, ranging from 100 to 0, matching percent metric
disregard second time 100
005 is a threshold where an alarm starts ringing, at 5% remaining lifetime
9659 is some vendor-specific internal metric from which 100% is produced.
 
Top