Hello All,
This is more of a 'what should I have done' type of question than a complaint.
I had a drive that reported a pending sector via smartctl. The status reported for the drive in the pool was 'null'. I replaced the drive via the GUI with a stand-by drive and the pool reslivered. Once that was done I clicked 'detach' and the system stopped responding. Checked the console and saw the Fatal trap 12 and the debugger prompt.
smartctl data from the drive prior to system hang:
Unfortunately I didn't notice when the error actually occurred and only noticed when I logged into the GUI to do another task. I've since setup email alerts. I did grab this snippet from the logs:
Would I have avoided this issue by setting the bad drive to offline before doing the replace and then detach if necessary? Or is there something else in play in this situation?
Thanks
This is more of a 'what should I have done' type of question than a complaint.
I had a drive that reported a pending sector via smartctl. The status reported for the drive in the pool was 'null'. I replaced the drive via the GUI with a stand-by drive and the pool reslivered. Once that was done I clicked 'detach' and the system stopped responding. Checked the console and saw the Fatal trap 12 and the debugger prompt.
Code:
Fatal Trap 12: page fault while in kernel mode cpuid=4; apic id = 04 fault virtual address = 0x8 fault code = supervisor read data, page not present instruction point = 0x20:0xffffffff8052e845 stack pointer = 0x28:ffffff80000f1a30 frame pointer = 0x28:ffffff80000f1b70 code segment = base 0x0, limit 0xfffff, type 0x1b= DPL 0, pres 1, long 1, def32 0, gran 1processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2 (g_event) [thread pid 2 tid 100021 ] Stopped at g_part_ctlreq+0x1375: cmpq $0x80c1a180,0x8(%r13)
smartctl data from the drive prior to system hang:
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 733 3 Spin_Up_Time 0x0027 180 169 021 Pre-fail Always - 3975 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 90 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 199 000 Old_age Always - 0 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2327 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 90 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 88 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1 194 Temperature_Celsius 0x0022 128 110 000 Old_age Always - 19 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 1403 - # 2 Extended offline Completed without error 00% 46 -
Unfortunately I didn't notice when the error actually occurred and only noticed when I logged into the GUI to do another task. I've since setup email alerts. I did grab this snippet from the logs:
Code:
Jan 25 02:11:13 freenas smartd[2936]: Device: /dev/da1 [SAT], 1 Currently unreadable (pending) sectors Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 c3 b1 0 1 0 0 length 131072 SMID 716 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 bd b1 0 1 0 0 length 131072 SMID 745 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 be b1 0 1 0 0 length 131072 SMID 75 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 bf b1 0 1 0 0 length 131072 SMID 564 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 c0 b1 0 1 0 0 length 131072 SMID 665 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:19:48 freenas kernel: Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 c1 b1 0 1 0 0 length 131072 SMID 926 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 c2 b1 0 1 0 0 length 131072 SMID 1000 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 bb b1 0 1 0 0 length 131072 SMID 361 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 bc b1 0 1 0 0 length 131072 SMID 782 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 ba b1 0 1 0 0 Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): CAM status: SCSI Status Error Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): SCSI status: Check Condition Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) Jan 25 02:20:00 freenas kernel: (da1:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 length 0 SMID 140 terminated ioc 804b scsi 0 state c xfer 0 Jan 25 02:20:00 freenas kernel: (da1:mps0:0:5:0): READ(10). CDB: 28 0 e8 e0 86 90 0 0 10 0 length 8192 SMID 170 terminated ioc 804b scsi 0 state c xfer 0 Jan 25 02:20:00 freenas kernel: (da1:mps0:0:5:0): READ(10). CDB: 28 0 e8 e0 84 90 0 0 10 0 length 8192 SMID 615 terminated ioc 804b scsi 0 state c xfer 0 Jan 25 02:20:00 freenas kernel: Jan 25 02:20:00 freenas kernel: (da1:mps0:0:5:0): READ(10). CDB: 28 0 0 40 2 90 0 0 10 0 length 8192 SMID 421 terminated ioc 804b scsi 0 state c xfer 0 Jan 25 02:20:39 freenas kernel: (da1:mps0:0:5:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 length 0 SMID 652 terminated ioc 804b scsi 0 state 0 xfer 0 Jan 25 02:20:39 freenas kernel: mps0: mpssas_alloc_tm freezing simq Jan 25 02:20:39 freenas kernel: mps0: mpssas_remove_complete on handle 0x000a, IOCStatus= 0x0 Jan 25 02:20:39 freenas kernel: mps0: mpssas_free_tm releasing simq Jan 25 02:20:39 freenas kernel: (da1:mps0:0:5:0): lost device - 4 outstanding, 1 refs
Would I have avoided this issue by setting the bad drive to offline before doing the replace and then detach if necessary? Or is there something else in play in this situation?
Thanks