pool failure

killer1

Cadet
Joined
Aug 2, 2019
Messages
2
Hi all,
Im pretty sure that 2 of my hdd's have failed at the same time with only 1 redundancy drive... Im a bsd/freenas noob and want someone in the know to confirm my suspicions.

# zpool import
pool: freenas
id: 10433434028436438315
state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-3C
config:

freenas UNAVAIL insufficient replicas
raidz1-0 UNAVAIL insufficient replicas
gptid/1f87a119-f7e4-11e9-ba24-1f52f00ab5d9 UNAVAIL cannot open
gptid/2078d030-f7e4-11e9-ba24-1f52f00ab5d9 ONLINE
gptid/216610dc-f7e4-11e9-ba24-1f52f00ab5d9 ONLINE
gptid/226dc24e-f7e4-11e9-ba24-1f52f00ab5d9 UNAVAIL cannot open
gptid/235d6af7-f7e4-11e9-ba24-1f52f00ab5d9 ONLINE

/var/log/messages
Dec 27 17:28:44 freenas da2 at mps0 bus 0 scbus2 target 12 lun 0
Dec 27 17:28:44 freenas da2: <ATA WDC WD20EFAX-68F 0A82> Fixed Direct Access SPC-4 SCSI device
Dec 27 17:28:44 freenas da2: Serial Number WD-WXD1A799C1F6
Dec 27 17:28:44 freenas da2: 600.000MB/s transfers
Dec 27 17:28:44 freenas da2: Command Queueing enabled
Dec 27 17:28:44 freenas da2: 1907729MB (3907029168 512 byte sectors)
Dec 27 17:28:44 freenas 1 2022-12-27T17:28:44.562071-08:00 freenas.local smartd 5571 - - Device: /dev/da2 [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Dec 27 17:29:17 freenas mps0: Controller reported scsi ioc terminated tgt 12 SMID 1118 loginfo 31080000
Dec 27 17:29:17 freenas mps0: Controller reported scsi ioc terminated tgt 12 SMID 1119 loginfo 31080000
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): WRITE(10). CDB: 2a 00 e8 e0 84 90 00 00 10 00
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): CAM status: CCB request completed with an error
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): Retrying command, 3 more tries remain
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): WRITE(10). CDB: 2a 00 e8 e0 86 90 00 00 10 00
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): CAM status: CCB request completed with an error
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): Retrying command, 3 more tries remain
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): WRITE(10). CDB: 2a 00 00 40 02 90 00 00 10 00
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): CAM status: SCSI Status Error
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): SCSI status: Check Condition
Dec 27 17:29:17 freenas (da2:mps0:0:12:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure)

if i unplug and plug back in the unavailable drive it shows online briefly before unavailable again. i get this smart log before it becomes unavailable

# smartctl -a /dev/da4
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red (SMR)
Device Model: WDC WD20EFAX-68FB5N0
Serial Number: WD-WXD1A799C1F6
LU WWN Device Id: 5 0014ee 211825297
Firmware Version: 82.00A82
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Dec 27 19:44:04 2022 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 8864) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 246) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 192 192 051 Pre-fail Always - 321
3 Spin_Up_Time 0x0027 171 170 021 Pre-fail Always - 2441
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 83
5 Reallocated_Sector_Ct 0x0033 133 133 140 Pre-fail Always FAILING_NOW 2611
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 071 071 000 Old_age Always - 21173
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 77
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 75
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 17
194 Temperature_Celsius 0x0022 111 104 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 2611
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1
ATA Error Count: 5
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 21170 hours (882 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 0c 00 00 00 00 Device Fault; Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ef 03 0c 00 00 00 00 00 00:00:16.600 SET FEATURES [Set transfer mode]
e5 00 00 00 00 00 00 00 00:00:16.600 CHECK POWER MODE
ec 00 00 00 00 00 00 00 00:00:16.600 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:15.364 IDENTIFY DEVICE

Error 4 occurred at disk power-on lifetime: 21170 hours (882 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 0c 00 00 00 00 Device Fault; Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ef 03 0c 00 00 00 00 00 00:00:17.838 SET FEATURES [Set transfer mode]
e5 00 00 00 00 00 00 00 00:00:17.838 CHECK POWER MODE
ec 00 00 00 00 00 00 00 00:00:17.838 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:16.273 IDENTIFY DEVICE

Error 3 occurred at disk power-on lifetime: 21170 hours (882 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 0c 00 00 00 00 Device Fault; Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ef 03 0c 00 00 00 00 00 00:01:21.316 SET FEATURES [Set transfer mode]
e5 00 00 00 00 00 00 00 00:01:21.316 CHECK POWER MODE
ec 00 00 00 00 00 00 00 00:01:21.316 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:01:19.766 IDENTIFY DEVICE

Error 2 occurred at disk power-on lifetime: 21170 hours (882 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 0c 00 00 00 00 Device Fault; Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ef 03 0c 00 00 00 00 00 00:06:41.255 SET FEATURES [Set transfer mode]
e5 00 00 00 00 00 00 00 00:06:41.255 CHECK POWER MODE
ec 00 00 00 00 00 00 00 00:06:41.250 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:06:39.592 IDENTIFY DEVICE

Error 1 occurred at disk power-on lifetime: 21169 hours (882 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 0c 00 00 00 00 Device Fault; Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ef 03 0c 00 00 00 00 00 00:06:50.509 SET FEATURES [Set transfer mode]
e5 00 00 00 00 00 00 00 00:06:50.509 CHECK POWER MODE
ec 00 00 00 00 00 00 00 00:06:50.498 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:06:48.826 IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: unknown failure 90% 20953 -
# 2 Short offline Completed: unknown failure 90% 20785 -
# 3 Short offline Completed: unknown failure 90% 20617 -
# 4 Short offline Completed without error 00% 20449 -
# 5 Short offline Completed without error 00% 20104 -
# 6 Short offline Completed without error 00% 19951 -
# 7 Short offline Completed without error 00% 19783 -
# 8 Short offline Completed without error 00% 19615 -
# 9 Short offline Completed without error 00% 19448 -
#10 Short offline Completed without error 00% 19280 -
#11 Short offline Completed without error 00% 19112 -
#12 Short offline Completed without error 00% 18944 -
#13 Short offline Completed without error 00% 18776 -
#14 Short offline Completed without error 00% 18608 -
#15 Short offline Completed without error 00% 18440 -
#16 Short offline Completed without error 00% 18273 -
#17 Short offline Completed without error 00% 18105 -
#18 Short offline Completed without error 00% 17937 -
#19 Short offline Completed without error 00% 17769 -
#20 Short offline Completed without error 00% 17601 -
#21 Short offline Completed without error 00% 17433 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Top